There are three popular ways to manipulate and analyze data frames in R:
These packages are not mutually exclusive, ie one might mix two of them. We can refer to them as domain specific languages (DSL)’s because they all offer distinct syntax to perform operations that resemble SQL type queries on tables. They go beyond SQL since they support general R functions.
I’m interested in these DSL’s for the purpose of code analysis and code transformations, for example creating parallel code from serial. To analyze the code we need to precisely understand the semantics when each function uses nonstandard evaluation (NSE).
This post shows all the different common ways to do a basic row selection operation using these three packages. If I missed any please let me know. I borrowed the example from the dplyr documentation. It’s equivalent to this SQL:
SELECT * FROM flights WHERE month=1 AND day=1;
Base R lets us build a logical vector that we can then use to filter the rows. We can build the vector in a few different ways.
library(nycflights13) flights_df = as.data.frame(flights) month_var = "month" day_var = "day" # standard evaluation: condition = flights_df[, month_var] == 1 & flights_df[, day_var] == 1 # NSE in `$` condition = flights_df$month == 1 & flights_df$day == 1 # NSE using `with` to scope inside the data frame condition = with(flights_df, month == 1 & day == 1)
Once we have the logical vector for the filter we can use it directly, or we can use nonstandard evaluation in subset.
# standard evaluation f11 = flights_df[condition, ] # NSE includes column names in scope f11 = subset(flights_df, month == 1 & day == 1) f11 = subset(flights_df, cond)
data.table resembles base R for this task. It uses standard evaluation in the special case when there is a single symbol in the first argument.
library(data.table) flights_dt = data.table(flights) # standard evaluation f11 = flights_dt[condition, ] # NSE includes column names in scope f11 = flights_dt[month == 1 & day == 1, ]
dplyr provides the largest number of ways to construct this query.
library(dplyr) flights_tbl = flights # standard evaluation f11 = filter(flights_tbl, condition) # NSE includes column names in scope f11 = filter(flights_tbl, month == 1 & day == 1) # NSE includes column names in scope and using `&` on args f11 = filter(flights_tbl, month == 1, day == 1) # NSE explicitly scoping inside flights_tbl f11 = filter(flights_tbl, .data$month == 1, .data$day == 1)
Computing on the language
It is possible to directly construct the code that uses NSE given the variable names by manipulating language objects. This seems like a complicated, indirect way of accomplishing the desired goal. Here it is for those who are curious:
call = substitute(flights_dt[col1 == 1 & col2 == 1, ] , list(col1 = as.symbol(month_var), col2 = as.symbol(day_var))) f11 = eval(call) # When eval() appears, think DANGER!
library(rlang) # tidyeval framework with quoting # https://stackoverflow.com/questions/24569154/use-variable-names-in-functions-of-dplyr month_quo = quo(month) day_quo = quo(day) f11 = filter(flights_tbl, (!!month_quo) == 1, (!!day_quo) == 1) # tidyeval framework with a string month_quo = new_quosure(as.symbol(month_var)) day_quo = new_quosure(as.symbol(day_var)) f11 = filter(flights_tbl, (!!month_quo) == 1, (!!day_quo) == 1)
The basic queries using NSE to refer to column names in a data frame are the easiest to analyze, because all the logic lives in just one call. So we need to know that the assumption that they are column names is correct. It’s possible to check this. If we compute the logical subset condition ahead of time then we need to look around other parts of the code to see how a variable was defined.
In general data.table’s approach is to use special symbols, while dplyr uses special functions. I’m not sure if one has an advantage for code analysis.
On a broader note, if we restrict ourselves to the class of operations on tables that all three of these approaches do well and easily then it may be possible to convert them to and from a common intermediate representation that captures the desired semantics. With the right representation we might be able to even go beyond the R language into say Python pandas and SQL. dplyr already does this in some sense by generating SQL directly. I explore this a bit more in a follow on post: representing semantics for data analysis code.
I think I need to understand more of the theory of databases to know if this is feasible or not. Maybe rquery offers some path forward in building abstract queries from R that we can optimize.
Here are the versions of the software used here:
> sessionInfo() R version 3.4.3 (2017-11-30) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.3 LTS ... other attached packages:  bindrcpp_0.2 rlang_0.1.2 dplyr_0.7.2 data.table_1.10.0  nycflights13_0.2.2