What happens to performance when data gets too large for memory? Formerly there was a limit for how large a vector could be in R, but that has been lifted. This means that we can create objects which are larger than memory and the operating system will spill them over into swap space.

To find out I did a little experiment creating and computing on a vector of length n, where n was chosen to take up a fixed fraction of the system memory. I ran this code:

x <- rnorm(n)
xbar <- mean(x)

Spinning Disk

Here’s what happened with a conventional spinning hard disk:

Timings start off almost perfectly linear in n, which is to be expected because the operations are O(n). Once it starts swapping to disk it immediately becomes more than an order of magnitude slower. Then performance follows a rough linear trend with quite a bit more variability.

No larger values of n worked because this code ran on a system with 8 GB of memory and 8 GB swap. If you try to make something too big R refuses with this message:

Error: cannot allocate vector of size 20 Gb

SSD

What if we swap to an SSD (solid state drive)?

Once swapping begins it only becomes about 2 - 3 times slower instead of 10 - 20 times slower. It continues to slow down once it’s swapping, but the rate is only about three times that of being wholy within memory. This is much better performance all around than the spinning disk.

Big thanks to our excellent UC Davis systems administrator Nehad Ismail for installing an SSD on one of our servers and configuring the swap.

bigmatrix

There are

Conclusions

Performance in R hits a wall once one starts going to disk, and this is why it’s to be avoided. If you’re actually using objects this big then you need to be well under these memory limits, since objects in R generally get copied right and left.

However, this does suggest that SSD’s have much to offer when it comes to improving performance and creative solutions. It’s exciting to see such active work in projects like R’s new fst.