column selection at read
Summary R’s data.table
package can efficiently select columns when
reading flat text files. For this data set it’s about 3 times faster than
piping the bash cut
command into R and reading with scan()
.
Data
Consider the CargoDescriptions2015.csv
data set. This is a 5.3 GB file
with 5 columns describing goods which arrived in the United States via shipping
containers in 2015.
data.table
Suppose we want to count the number of unique
containers. The naive way is to read in the entire data set. Using the high
performance multi-threaded fread
function from data.table
.
```{R, eval = FALSE}
system.time(cargo <- data.table::fread(“~/data/CargoDescriptions2015.csv”))
This takes 2 minutes, 35 seconds on a 2016 Macbook Pro with 8 cores and
uses 5.5 GB of memory in the R process.
Since we're already using `data.table` lets use that to answer the original
question.
```{R}
system.time(num_container <- uniqueN(cargo[, "container_number"]))
nrow(cargo) / num_container
This takes between 9 and 10 seconds. We see that on average each container appears about 5 times.
Can we do this faster with data.table
by reading in only a subset of the
columns? To realistically time this we first need to remove this file from
the file cache.
$ vmtouch -ve ~/data/CargoDescriptions2015.csv
Evicting CargoDescriptions2015.csv
Files: 1
Directories: 0
Evicted Pages: 1383045 (5G)
Elapsed: 0.24117 seconds
Now lets see how long data.table
takes to read just the columns that we
need.
system.time(containers <- data.table::fread("~/data/CargoDescriptions2015.csv"
, select = "container_number", nThread = 4L))
# Mon Sep 25 09:56:55 PDT 2017
#
# user system elapsed
# 107.664 5.942 32.428
# user system elapsed
# 110.753 6.008 33.163
system.time(num_container <- uniqueN(containers))
Selecting the column at the data load step reduces the load time from 2.5 minutes to about 32 seconds. This brings the total program run time from 165 seconds to around 40 seconds.
Subsequent timings of this read in the same R session are much faster, even after forcing the file out of cache. I believe the reason for this is because on subsequent reads all the unique strings have already been stored in R’s internal string / integer table. This would be simple enough to verify experimentally.
The moral of the story is that the exact same calculations can be computed more efficiently if we know ahead of time what is necessary.
disk speed
Side note: this requires a full scan of a file that’s around 5 GB. It happened in 33 seconds. The file was not cached in memory. This means my laptop SSD is able to load data into memory with a rate of at least 5 GB / 33 seconds = 150 MB / second. This sounds reasonable, but can I verify?
Mon Sep 25 10:06:12 PDT 2017
$ vmtouch -ve ~/data/CargoDescriptions2015.csv
Evicting /Users/clark/data/CargoDescriptions2015.csv
Files: 1
Directories: 0
Evicted Pages: 1383045 (5G)
Elapsed: 0.38839 seconds
$ vmtouch -t ~/data/CargoDescriptions2015.csv
Files: 1
Directories: 0
Touched Pages: 1383045 (5G)
Elapsed: 7.9652 seconds
Just came back to this and found that vmtouch can load this file into cached memory in 8 seconds, implying a rate of 625 MB / second.
base R
For comparison, let’s measure the same operation in base R. We can select
the columns we need through the colClasses
parameter to read.table
.
# 607 seconds
system.time(cargo <- read.table("~/data/CargoDescriptions2015.csv"
, header = TRUE, sep = ",", quote = "\"", comment.char = ""
, colClasses = c("numeric", "character", "integer", "integer", "character")
))
# 493 seconds
system.time(containers <- read.table("~/data/CargoDescriptions2015.csv"
, header = TRUE, sep = ",", quote = "\"", comment.char = ""
, colClasses = c("NULL", "character", "NULL", "NULL", "NULL")
))
# 2.96 seconds. 3X faster in base R than data.table.
system.time(num_container <- length(unique(containers[, 1])))
These read times suggest that perhaps it would be desirable to just transform
regular R code into data.table
or iotools
if IO is the bottleneck.
with shell preprocessing
Another way to potentially make R faster is to use the shell to select the column of interest. This approach is less robust than the others because it is unlikely to handle corner cases well, ie. escape characters and quoted strings containing the delimiter character.
Running on the above data set I see issues with
value "SACRAMENTO,"
, which gets incorrectly cut to
"SACRAMENTO
.
I’m expecting 43754131 observations.
system.time(containers <- read.table(
pipe("cut -d , -f 2 ~/data/CargoDescriptions2015.csv")
, header = TRUE, sep = ",", quote = "\"", comment.char = ""
, colClasses = "character"
))
# Didn't work:
#
#Warning message:
#In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
# EOF within quoted string
# > dim(containers)
# [1] 8294588 1
#
system.time(containers2 <- scan(
pipe("cut -d , -f 2 ~/data/CargoDescriptions2015.csv")
, what = character(), skip = 1, sep = "\n"
))
# Read 43754131 items
# user system elapsed
# 147.888 2.216 97.881
length(containers2)
The second command seems to work, with the caveats above. This takes base R down
from 493 seconds to 98 seconds, a factor of 5. It also should use much less
memory because R only ever sees the column that it needs. data.table
is still
3 times faster than this and much more robust.
If I’m going to do something like this automatically I need to be able to
create these cut
commands. It would be better to rely on as few external
dependencies as possible, bash cut
is not available on Windows. What if I
used Python? That’s another external dependency. It wouldn’t surprise me
if it was as fast here as the cut command. Now I’m curious.
system.time(containers2 <- scan(
pipe("python ~/data/select_column.py ~/data/CargoDescriptions2015.csv 2")
, what = character(), skip = 1, sep = "\n"
))
# Read 43754131 items
# user system elapsed
# 188.837 2.993 138.850
Of course introducing a dependency on another programming language is far from ideal. I wonder if there’s a way to pipe the R commands to get pipeline parallelism? Judging by the user and elapsed times here we did get some pipeline parallelism, both for the Bash and Python approaches.
What is the limiting factor here? Preprocessing or other? I can find best case performance by creating another file with only the column of interest and then reading that.
N = 43754131
system.time(containers2 <- scan("~/data/container_number.csv"
, what = character(), skip = 1, sep = "\n", n = N
))
# Read 43754131 items
# user system elapsed
# 70.371 0.517 70.899
# scan() seems to have three parameters similar to `n`
system.time(containers2 <- scan("~/data/container_number.csv"
, what = character(), skip = 1, sep = "\n", n = N, nlines = N
))
# Read 43754131 items
# user system elapsed
# 70.721 0.537 71.273
The times above are consistent even when I remove the file from cache. This
makes sense based on what I saw above with data.table
. Then around 70
seconds is a lower bound for how fast this load can happen with scan
. The
Python and bash cut
versions are 2 to 3 times slower than this.