Monday, May 13, 2019

Randomness: Friend or Foe?

I spent a chunk of the weekend debugging some code (which involved solving an optimization problem). There was an R script to setup input files and a Java program to process them. The Java program included both a random heuristic to get things going and an integer program solved by CPLEX.

Randomness in algorithms is both a blessing and a curse. Without it, genetic algorithms would reduce to identical clones of one individual engaging in a staring contest, simulation models would produce absurdly tight confidence intervals (thanks to the output have zero variance) and so on. With it, though, testing and debugging software gets tricky, since you cannot rely on the same inputs, parameter settings, hardware etc. to produce the same output, even when the program is function correctly. If I rerun a problem and get different results, or different performance (e.g., same answer but considerably faster or slower), is that an indication of a bug or just luck of the draw?

Somewhere, back in the Dark Ages, I was told that the answer to all this was to set the seed of the pseudorandom number stream explicitly. This was after the introduction of pseudorandom number generators, of course. Before that, random numbers were generated based on things like fluctuations in the voltage of the power supplied to the mainframe, or readings from a thermocouple stuck ... well, never mind. Today hardware random number generators are used mainly in cryptography or gambling, where you explicitly do not want reproducibility. With pseudorandom numbers, using the same seed will produce the same sequences of "random" values, which should hypothetically make reproducibility possible.

I think I originally came across that in the context of simulation, but it also applies in optimization, and not just in obviously random heuristics and metaheuristics. CPLEX has a built-in pseudorandom number generator which is used for some things related to branching decisions when solving integer programs (and possibly other places too). So explicitly setting a seed for both the random number generator used by my heuristic and CPLEX should make my code reproducible, right?

Wrong. There are two more wildcards here. One is that my Java code uses various types of collections (HashMaps, HashSets, ...) and also uses Streams. Both of those can behave in nondeterministic ways if you are not very careful. (Whether a Stream is deterministic may boil down to whether the source is. I'm not sure about that.) The other, and I'm pretty sure this bites me in the butt more often than any other aspect, is the use of parallel threads. My code runs multiple copies of the heuristic in parallel (using different seeds), and CPLEX uses multiple threads. The problem with parallel threads is that timing is nondeterministic, even if the sequence of operations in each thread is. The operating system will interrupt threads willy-nilly to use those cores for other tasks, such as updating the contents of the display, doing garbage collection in Java, pinging the mail server or asking Skynet whether it's time to terminate the user). If there is any communication between the processes running in parallel threads in your code, the sequencing of the messages will be inherently random. Also, if you give a process a time limit and base it on "wall clock" time (which I'm prone to doing), how far the process gets before terminating will depend on how often and for how long it was interrupted.

Limiting everything to a single thread tends not to be practical, at least for the problems I find myself tackling. So I'm (somewhat) reconciled to the notion that "reproducibility" in what I do has to be interpreted somewhat loosely.

4 comments:

  1. FICO Xpress MIP solver has some guarantees on the reproducibility of a solution path:

    https://community.fico.com/s/blog-post/a5Q8000000082YvEAI/fico1154

    ReplyDelete
  2. Thanks for the link. CPLEX also has a "deterministic parallel search mode" that can be set via a parameter (along with parameters for making various time limits deterministic). I don't think I've ever tried the "deterministic parallel search mode", mainly because my desire for consistency is usually dwarfed by my desire for speed. It would be interesting to know whether results from either CPLEX or Xpress were 100% reproducible with the right settings (assuming same problem inputs, same hardware etc.).

    ReplyDelete
  3. Fun fact: the random number generator in .NET produces more odd numbers than even. So just relying on that might skew your simulation results:
    https://github.com/dotnet/corefx/issues/23298

    ReplyDelete
    Replies
    1. Well, that's a bit unsettling at minimum. Given my experience with other M$ software in the past, though, I can't say I'm shocked. I guess it could be worse. A long, long time ago I was using a pseudorandom number generator ... can't recall in what language ... and someone published a report that it cycled. I tested, and sure enough it cycled (after 100 or so observations, I think).

      Delete

Due to intermittent spamming, comments are being moderated. If this is your first time commenting on the blog, please read the Ground Rules for Comments. In particular, if you want to ask an operations research-related question not relevant to this post, consider asking it on Operations Research Stack Exchange.