Friday, May 6, 2016

Accessing R Objects By Name

At a recent R user group meeting, the discussion at one point focused on two of the possibly lesser known (or lesser appreciated?) functions in the base package: get and assign. The former takes a string argument and fetches the object whose name is contained in the string. The latter does the opposite, assigning an existing object to a variable whose name is a string argument.

I’ve actually used get once or twice, though not often. As an example of a use case, suppose that you create a Shiny interactive application that lets a user select one of the 104 data sets in the “datasets” package that comes with R, and then torture it in some way. In the user interface, you might give the user a list of names (or descriptions) of the data sets in a select list. Let’s say that the user chooses Fisher’s iris data set, and the name “iris” is returned in the variable input$ds. In the server code, you want to create a variable (we’ll call it df) containing the data frame to be analyzed. You can do that with df <- get(input$ds). After executing that line, df will contain the Fisher iris data.

I have a harder time finding a use for assign, but one of the moderators at the meeting said he uses it (and get) regularly to create new names for objects. As an example (mine, not his, so I’ll take the blame for it), you might decide for some reason that you want the 104 data frames in the “datasets” package renamed as “foo1” through “foo104”. His approach would be to use assign to do this. Given the same task, my first impulse would be to create a list associating “foo1” with the first data frame and so on.

We got into a discussion of whether a dynamically expanding list would be slower than using assign. I decided to do a little experiment, documented below. Spoiler alert: For 100 or so objects, there’s not a measurable difference.

The experiment

As noted above, the “datasets” package, which ships with R and is automatically put in the search path by default (i.e., you don’t have to load it explicitly with library or require), contains 104 data frames. I’ll assign each of them a new name, “foo1” through “foo104”, and then access them, first with assign/get and then with a list.

Setup

The first step is to load the list of database names into variable data.sets and then create the new names (in foonames).

data.sets <- ls("package:datasets")              # list of data set names
foonames <- paste0("foo", seq_along(data.sets))  # creates "foo1" ...

Next, I’ll count them (and put the result in count, for use later in loop constructs).

count <- length(data.sets)                       # count = 104

Timing an empty loop

To make sure the timing measurements that follow do not get too polluted by the time R needs to execute a simple “for” loop 104 times, I’ll separately time an empty loop first. The identity function is about as close as I could find to a “no-op” function in R. The last column in the output (“elapsed”), measured in seconds, is what is of interest.

system.time(for (i in 1:count) identity(i))
##    user  system elapsed
##       0       0       0

So, on my fairly modest desktop PC, the looping itself consumes negligible time.

Using get and assign

I’ll now store the 104 data frames in a list, using the get function, and then assign new names to the list entries using the list. The reason for doing this, rather than just using get and assign together in a single line, is that I want to separate the timing of assign and the timing of get.

temp <- as.list(1:count)                                          # create a list of length 104
system.time(for (i in 1:count) (temp[[i]] <- get(data.sets[i])))  # put each data frame in the list
##    user  system elapsed
##    0.02    0.00    0.02

Note that datasets[i] is a string containing the name of the i-th data frame. According to the output, 104 calls to get took negligible time.

Next, I’ll assign each data frame to the corresponding “foo” variable.

system.time(for (i in 1:count) assign(foonames[i], temp[[i]]))    # assign each data frame a new name
##    user  system elapsed
##   0.001   0.000   0.001

Once again, the time is negligible.

Last, I’ll fetch each data frame using its new name.

system.time(for (i in 1:count) get(foonames[i]))                  # fetch each data frame by its "foo" name
##    user  system elapsed
##       0       0       0

This too takes negligible time. (Methinks a pattern is emerging.)

One more thing: just to make sure the code does what I claim, let’s print the fifth data frame and “foo5” and see if they match.

data.sets[[5]]    # get the name of the fifth data frame in the list
## [1] "anscombe"
anscombe          # list that data frame explicitly
##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89
get("foo5")       # access foo5 using the get method
##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89
foo5              # access foo5 by name
##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89

Everything matches, so the code is behaving as expected.

Using a list

Before trying the list method, I will clear memory and recreate the list of foo names to avoid any corruption by results of the previous steps.

rm(list = ls())                                  # empty the environment
data.sets <- ls("package:datasets")              # list of data set names
foonames <- paste0("foo", seq_along(data.sets))  # creates "foo1" ...
count <- length(data.sets)                       # count = 104

My preferred approach starts by creating a list of entries with the form fooname = data frame:

my.list <- list()                                                              # create an empty list
system.time(for (i in 1:count) my.list[[foonames[i]]] <- get(data.sets[[i]]))  # fill the list with data frames
##    user  system elapsed
##   0.001   0.000   0.001

Again, negligible time was consumed making the list. Next, I’ll time accessing the 104 data frames.

system.time(for (i in 1:count) my.list[[foonames[i]]])  # access each data frame from the list
##    user  system elapsed
##       0       0       0

Finally, I’ll look at “foo5” to make sure it is what we expect (the Anscombe data base).

my.list[["foo5"]]  # is this Anscombe?
##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89

Yes, it is.

Conclusion

For something on the order of 100 objects, using get and assign and using an ordinary list for them seem to work equally well, with no meaningful difference in execution time.

This may not generalize to really large numbers of objects, but individually naming (and accessing by name) tens or hundreds of thousands of objects (or more) strikes me as a fairly low likelihood scenario.

Source code

Reproducible research seems to be a popular topic these days. I generated this post (give or take a little massaging of HTML to fit the blog style) in RStudio using an R Markdown file, which you can download here. If you “knit” it in RStudio on your own machine, you will get timing results for your setup.