Punchline: Watch out for lazy evaluation and .Primitive() functions when profiling R code.

While working on parallel R code, I wrote a chunked version of the covarance function. The idea is to split the original problem into several chunks that can be computed in parallel. This version is serial, but the idea is the same. The actual work is done when the cov() function included with the stats package is called on each chunk.

For an \(n \times p\) matrix \(x\) with \(n = 1,000,000\) and \(p = 10\) the chunked version was about 3 times slower than just calling cov() by itself. This surprised me, since I expected the overhead of breaking the problem into chunks to be dominated by the actual computation inside calls to cov(), thus making the chunked version only marginally slower then cov() for sufficiently large \(n\).

So I profiled it, and saw that nearly all of the time was spent inside is.data.frame():

Rprof("cov_chunked.out")
replicate(10, cov_chunked(x))
Rprof(NULL)

summaryRprof("cov_chunked.out")

$by.self
                self.time self.pct total.time total.pct
"is.data.frame"      7.72    56.35       7.72     56.35
"cov"                5.98    43.65      13.70    100.00

How could is.data.frame be so slow? I looked at it quite carefully, stepped through the debugger, traced it, counted all the calls, but didn’t understand. is.data.frame is a simple function that checks the class, and takes less than 1 microsecond. Yet the profiler shows that each call to is.data.frame() was taking hundreds of milliseconds, depending on the problem size of \(n, p\). This dependence on the problem size turned out to be a major clue.

The following line was inside an lapply function:

cov(x[, index[[1]]], x[, index[[2]]])

Looking at the cov() function itself, the first uses of arguments x and y are when it calls is.data.frame() on them. In this case cov called is.data.frame() on x[, index[[1]]], which was an unevaluated promise until that point. The subset operator [ was taking hundreds of milliseconds to allocate memory and copy large chunks of x. R’s copy on write model doesn’t apply here, because we didn’t just pass in x alone.

The profiler should have showed me the time was actually spent within [. But here’s what we see at the R level:

> `[`
.Primitive("[")

This means that [ does absolutely everything in C, bypassing R code completely. And it never showed up on the profiler.

Big thanks to Duncan Temple Lang for looking through this with me and figuring it out~