using global variables with multicore
It’s generally considered poor style to use global variables inside functions. For example:
n = 1e8
x = rnorm(n)
# implicitly uses a global variable. Bad.
f_global = function() mean(x)
# explicitly uses a global variable. Acceptable.
f = function(.x = x) mean(.x)
# They do the same thing.
f_global() == f()
But if the data we’re computing on is very large then it can be expensive
to transfer the data, so we want to avoid this. The following code creates
a new forked R process where the variable x
is defined, so that both
f()
and f_global()
work.
library(parallel)
cls = makeCluster(1L, "FORK")
calls = parse(text = "
f()
f_global()
clusterCall(cls, f)
clusterCall(cls, f, x)
clusterCall(cls, f_global)
")
times = sapply(calls, function(expr)
system.time(eval(expr), gcFirst = FALSE)["elapsed"]
)
png("../assets/2018-02-20_call_times_in_cluster.png")
dotchart(times, calls, xlim = c(0, max(times)), xlab = "seconds elapsed")
dev.off()
When we call clusterCall
with x
as an argument the master process
serializes x
to the worker. We can tell none of the other versions of the
code transfer x
because the other times are all the same.
This particular machine takes around 1 second
to transfer an object x
which takes up about 1 GB of memory, for a speed
on the order of 1 GB/sec. Running this same code with a remote worker
exacerbates the problem, because the network must transfer the data.
If we look at the source for R’s parallel package we see that the
additional arguments are gathered up into list(...)
and sent to the
worker through sendCall()
which calls postNode()
which calls
sendData()
which calls sendData.SOCKnode()
which finally calls
serialize()
to transfer the data.
sendData.SOCKnode <- function(node, data) serialize(data, node$con)
Thoughts
On a side note, I was pleased that dotchart()
accepted the code
expressions. It shows we can use normal R style with code objects. Just be
careful.
I was a little surprised that clusterCall(cls, f)
didn’t transfer x
.
But maybe I shouldn’t be surprised, because to know that x
“should” be
transferred clusterCall
would have had to inspect the signature for f
and analyze the arguments.