Friday, June 6, 2014

Reproducibility and Java Collections

I've been immersed in Java coding for a research project, and I keep tripping over unintentional randomness in the execution of my code. Coincidentally, I happened to read a blog post today titled "Some myths of reproducible computational research", by C. Titus Brown, a faculty member (one of those hybrid species: Computer Science and Biology) at Michigan State University, my former employer. The post is worth reading. I've seen quite an up-tick in online discussions of the importance of reproducible research, sharing code, sharing data etc., to which I will add one more virtue (specific to computational research): it's a darn site easier to debug a program whose behavior is deterministic compared to debugging an inherently stochastic program.

That brings me to a couple of tripwires that have been snagging me on the recent project (and some other projects, for that matter). First, any code that uses parallel threads is capable of injecting some unwanted randomness. A given thread will consume different amounts of time in different runs, due to uncontrollable (and hence, for our purposes, "random") interference by external processes that from time to time are swapped into the CPU core running the given thread. If your program has multiple interdependent threads, the timing of when one thread gets a message from another thread will be different on each run unless you do some serious (and IMHO seriously tedious) magic to synchronize them. I generally shy away from coding multiple threads myself, the exception being programs with a graphical user interface, where the GUI has its own set of scheduling/event threads, and really lengthy computations need to be done in "worker" threads to avoid paralyzing the GUI. Even then, when I'm using a multithreaded library such as CPLEX, execution times vary in unpredictable ways.

The second tripwire has to do with Java collections. As a mathematician, I tend to think in terms of matrices, vectors and sets, and not so much in terms of lists and maps. I'll often code a collection of objects as a HashSet, both because I think of them as a (mathematical) set rather than a list, and because the Set interface in Java has one key property not shared by the List interface: adding an object to a Set that already contains it does not create a second instance of the object. Unfortunately, a key shortcoming (IMHO) of HashSet is clearly spelled out in its documentation:
It makes no guarantees as to the iteration order of the set; in particular, it does not guarantee that the order will remain constant over time.
To give one example of what this implies for reproducibility (and debugging), I coded a heuristic for constructing a spanning tree that takes eac node from a set (as in HashSet) of unattached nodes and links it to randomly selected node from the tree under construction. The selection of the node already in the tree to which the new node will be linked (by an edge) is done using a pseudorandom number stream with an explicit, user-specified seed. By repeating the seed, I could reproduce the choices of those nodes, if only the rest of the heuristic were deterministic. Unfortunately, because iterating over a HashSet is random in irreproducible ways, I get a different tree each time I run the code, even with the same seed value. So I need to change that HashSet to a list of some sort ... and I need to remember this lesson the next time I'm tempted to use a HashSet.

6 comments:

  1. Isn't there some kind of ordered map in Java? A quick search suggests LinkedHashSet, but I haven't fully read through the documentation.

    ReplyDelete
    Replies
    1. Geoffrey De Smet posted a comment on G+ (https://plus.google.com/u/0/+PaulRubin/posts/DQpP89wkLNp) indicating that LinkedHashSet and LinkedHashMap produce reproducible results without large performance penalty.

      TreeMap and TreeSet are ordered in the sense of sort-order. I imagine there's a performance penalty for the sorting, but perhaps it's not too bad.

      Delete
    2. I've used TreeSet a bit for this sort of thing, substituting it for HashSet when faced with similar problems. While I haven't actually run any explicit benchmarks, the performance of applications dealing with sets of thousands to hundreds of thousands of records didn't seem to get worse. At least, it didn't depreciate to the point where I investigated why.

      Delete
    3. Ryan: Thanks, that's useful to know. I did a global search-and-replace on two projects, replacing HashSet and HashMap with LinkedHashSet and LinkedHashMap. Everything that worked before still worked, and I did not notice any performance penalty -- but, unlike you, I wasn't working with any large data instances.

      Delete
  2. Thanks for sharing your experience on that. We ran into similar issues with a Python tool we developed for generating and customizing optimization models. It used collections which was convenient but didnt keep the order in which we added items. When iterating through the constraint collection to build the models, it entered the constraints in a different order each time. Thus, running the same model with CPLEX multiple times would result in different optimal basis, leading in turn to significantly different run times.

    ReplyDelete
    Replies
    1. Marc-Andre: I find it a little bit reassuring that the same thing can happen with collections in Python. I was half expecting an "I told you so" from the Pythonistas (ii.e., this is what I get for resisting the Python wave and sticking with Java). :-)

      Delete

If this is your first time commenting on the blog, please read the Ground Rules for Comments. In particular, if you want to ask an operations research-related question not relevant to this post, consider asking it on OR-Exchange.