Monday, May 30, 2016

Java "Deep Learning" Library

If you are a Java (or Scala) (or maybe Clojure?) programmer interested in analytics, and in particular machine learning, you should take a look at Deeplearning4j (DL4J). Quoting their web site:
Deeplearning4j is the first commercial-grade, open-source, distributed deep-learning library written for Java and Scala. Integrated with Hadoop and Spark, DL4J is designed to be used in business environments on distributed GPUs and CPUs. Skymind is its commercial support arm.
In essence, DL4J is a library for building "deep" neural networks, where "deep" apparently means more than one hidden layer. The software is open source (Apache 2.0 license). There is some free support available (from the user community, and possibly the developers). Commercial users needing serious support apparently can buy a support contract from Skymind.

Now, I don't know much about neural networks, although I do have some interest. Right now, I don't have time to put DL4J through its paces. Someday, if I find the time, I'll probably post some observations here. Meanwhile, I did download and install DL4J, along with the examples they provide, and I ran several of the examples successfully (and none unsuccessfully).

Their web site is very well done, with more links to general resources, documentation and tutorials than I think I've seen for any open source project before. So if the notion of coding serious (not just toy) neural nets in a JVM language appeals to you, I think you'd be well served to check them out.

Saturday, May 28, 2016

Rounding Error

I recently experienced the eight-bazillionth iteration of the following scenario. Someone posts a request for help on a forum. They are doing some sort of MIP model, or possibly something even more advanced (Benders decomposition, column generation, ...), and things are not going well. In some cases they think they have made a coding error; more commonly they think the solver (CPLEX in my personal experience, since that's the only one whose forums I watch) has a bug in it. After substantial back and forth, the key problem emerges: their binary variable came out 1.00000 E-9 or 0.99999999, which is "clearly" neither 0 nor 1.

At this point, my reaction is always the same (add your favorite expletives after every few words): "Don't they know about rounding error???"

In fairness, though, they may not, and perhaps the blame lies on the shoulders of faculty (possibly including me, in my former life). One of the courses I took during my first term of graduate school (in math) was numerical analysis. This is back when mainframes were just making the transition from vacuum tubes to semiconductors, and "user input" was a euphemism for "deck of punch cards". It's also a time when memory was at a premium and, thus, single precision ruled the realm.

(BEGIN TANGENT) I suspect that some readers who are (much) younger than I, and who program, will have noticed that pretty much everything of a floating point nature is automatically done double-precision these days, and may be wondering why single-precision float types still exist. They're a holdover from my youth, and I suspect exist almost exclusively for backward compatibility. (END TANGENT)

I don't recall why I picked numerical analysis as an elective -- whimsy, I think -- but it ended up being one of the most useful courses I took in grad school. The first, and most important, lesson from it was that computers do math inexactly. This should be immediately obvious if you stop to think about it. I learned in childhood that 0.3333... = 1/3, where the ... means "repeated an infinite number of times". Stop the decimal expansion after a finite number of digits, and you get something slightly less than 1/3. Now pretend for the moment that your computer encoded numbers in decimal rather than binary, and that you used the expression 1/3 in a program. The compiler would convert 1/3 to a decimal representation, but not 0.3333... -- because that would require an infinite amount of memory to store, and the computer only has a finite amount. So the computer has to truncate the decimal expansion, and as a result what the computer stores for 1/3 is not actually 1/3. (Now go back to using binary. The same logic applies.)

I'm not going to repeat an entire course in numerical analysis here (too much typing), but another lesson I learned was that, when programming, I should never do strict equality comparisons on anything that was not stored as an integer. Here's a small example that I just ran in R (not to pick on R -- it was just handy):
> sqrt(9) - 2.99999999999999999 == 0
[1] TRUE

To a mathematician, that relation is patently false, but to the program the decimal expansion is "close enough" to 3.

My prof shared one anecdote that has stuck with me for almost half a century now. He told of an engineer who coded his own routine for inverting a matrix and was not particularly careful about dealing with rounding errors (and "stiff" matrices, a topic for another time). The engineer then published a paper which included, at one point, the inverse of a particular matrix, computed by his program, with each entry in the matrix printed to some absurd number of decimal places (five, I think). All the digits to the right of the decimal points were wrong. So were the first couple of digits immediately left of the decimal points. You can ignore the limitations of computer arithmetic precision, but (queue sinister background music) those limitations will not ignore you.

Where I'm going with this is that I learned about the limitations of computer math somewhat accidentally, because I just happened to take numerical analysis. I suspect that some OR/IE programs require it, and I'm pretty sure others (most?) don't. In the interests of reducing frustration for a lot of people (of whom I'm the only one I care about), and perhaps avoiding incorrect results being published, I think we should pressure anyone teaching computational topics in OR (particularly, but not exclusively, integer programming) to provide some minimal coverage of rounding and truncation errors, the IEEE standards for single- and double-precision values, etc. If they cannot (or will not) teach it themselves, they can at least require that the student go learn it as a prerequisite.

Please! I'm begging you!

Friday, May 6, 2016

Accessing R Objects By Name

At a recent R user group meeting, the discussion at one point focused on two of the possibly lesser known (or lesser appreciated?) functions in the base package: get and assign. The former takes a string argument and fetches the object whose name is contained in the string. The latter does the opposite, assigning an existing object to a variable whose name is a string argument.

I’ve actually used get once or twice, though not often. As an example of a use case, suppose that you create a Shiny interactive application that lets a user select one of the 104 data sets in the “datasets” package that comes with R, and then torture it in some way. In the user interface, you might give the user a list of names (or descriptions) of the data sets in a select list. Let’s say that the user chooses Fisher’s iris data set, and the name “iris” is returned in the variable input$ds. In the server code, you want to create a variable (we’ll call it df) containing the data frame to be analyzed. You can do that with df <- get(input$ds). After executing that line, df will contain the Fisher iris data.

I have a harder time finding a use for assign, but one of the moderators at the meeting said he uses it (and get) regularly to create new names for objects. As an example (mine, not his, so I’ll take the blame for it), you might decide for some reason that you want the 104 data frames in the “datasets” package renamed as “foo1” through “foo104”. His approach would be to use assign to do this. Given the same task, my first impulse would be to create a list associating “foo1” with the first data frame and so on.

We got into a discussion of whether a dynamically expanding list would be slower than using assign. I decided to do a little experiment, documented below. Spoiler alert: For 100 or so objects, there’s not a measurable difference.

The experiment

As noted above, the “datasets” package, which ships with R and is automatically put in the search path by default (i.e., you don’t have to load it explicitly with library or require), contains 104 data frames. I’ll assign each of them a new name, “foo1” through “foo104”, and then access them, first with assign/get and then with a list.

Setup

The first step is to load the list of database names into variable data.sets and then create the new names (in foonames).

data.sets <- ls("package:datasets")              # list of data set names
foonames <- paste0("foo", seq_along(data.sets))  # creates "foo1" ...

Next, I’ll count them (and put the result in count, for use later in loop constructs).

count <- length(data.sets)                       # count = 104

Timing an empty loop

To make sure the timing measurements that follow do not get too polluted by the time R needs to execute a simple “for” loop 104 times, I’ll separately time an empty loop first. The identity function is about as close as I could find to a “no-op” function in R. The last column in the output (“elapsed”), measured in seconds, is what is of interest.

system.time(for (i in 1:count) identity(i))
##    user  system elapsed
##       0       0       0

So, on my fairly modest desktop PC, the looping itself consumes negligible time.

Using get and assign

I’ll now store the 104 data frames in a list, using the get function, and then assign new names to the list entries using the list. The reason for doing this, rather than just using get and assign together in a single line, is that I want to separate the timing of assign and the timing of get.

temp <- as.list(1:count)                                          # create a list of length 104
system.time(for (i in 1:count) (temp[[i]] <- get(data.sets[i])))  # put each data frame in the list
##    user  system elapsed
##    0.02    0.00    0.02

Note that datasets[i] is a string containing the name of the i-th data frame. According to the output, 104 calls to get took negligible time.

Next, I’ll assign each data frame to the corresponding “foo” variable.

system.time(for (i in 1:count) assign(foonames[i], temp[[i]]))    # assign each data frame a new name
##    user  system elapsed
##   0.001   0.000   0.001

Once again, the time is negligible.

Last, I’ll fetch each data frame using its new name.

system.time(for (i in 1:count) get(foonames[i]))                  # fetch each data frame by its "foo" name
##    user  system elapsed
##       0       0       0

This too takes negligible time. (Methinks a pattern is emerging.)

One more thing: just to make sure the code does what I claim, let’s print the fifth data frame and “foo5” and see if they match.

data.sets[[5]]    # get the name of the fifth data frame in the list
## [1] "anscombe"
anscombe          # list that data frame explicitly
##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89
get("foo5")       # access foo5 using the get method
##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89
foo5              # access foo5 by name
##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89

Everything matches, so the code is behaving as expected.

Using a list

Before trying the list method, I will clear memory and recreate the list of foo names to avoid any corruption by results of the previous steps.

rm(list = ls())                                  # empty the environment
data.sets <- ls("package:datasets")              # list of data set names
foonames <- paste0("foo", seq_along(data.sets))  # creates "foo1" ...
count <- length(data.sets)                       # count = 104

My preferred approach starts by creating a list of entries with the form fooname = data frame:

my.list <- list()                                                              # create an empty list
system.time(for (i in 1:count) my.list[[foonames[i]]] <- get(data.sets[[i]]))  # fill the list with data frames
##    user  system elapsed
##   0.001   0.000   0.001

Again, negligible time was consumed making the list. Next, I’ll time accessing the 104 data frames.

system.time(for (i in 1:count) my.list[[foonames[i]]])  # access each data frame from the list
##    user  system elapsed
##       0       0       0

Finally, I’ll look at “foo5” to make sure it is what we expect (the Anscombe data base).

my.list[["foo5"]]  # is this Anscombe?
##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89

Yes, it is.

Conclusion

For something on the order of 100 objects, using get and assign and using an ordinary list for them seem to work equally well, with no meaningful difference in execution time.

This may not generalize to really large numbers of objects, but individually naming (and accessing by name) tens or hundreds of thousands of objects (or more) strikes me as a fairly low likelihood scenario.

Source code

Reproducible research seems to be a popular topic these days. I generated this post (give or take a little massaging of HTML to fit the blog style) in RStudio using an R Markdown file, which you can download here. If you “knit” it in RStudio on your own machine, you will get timing results for your setup.