I spent a chunk of the weekend debugging some code (which involved solving an optimization problem). There was an R script to setup input files and a Java program to process them. The Java program included both a random heuristic to get things going and an integer program solved by CPLEX.
Randomness in algorithms is both a blessing and a curse. Without it, genetic algorithms would reduce to identical clones of one individual engaging in a staring contest, simulation models would produce absurdly tight confidence intervals (thanks to the output have zero variance) and so on. With it, though, testing and debugging software gets tricky, since you cannot rely on the same inputs, parameter settings, hardware etc. to produce the same output, even when the program is function correctly. If I rerun a problem and get different results, or different performance (e.g., same answer but considerably faster or slower), is that an indication of a bug or just luck of the draw?
Somewhere, back in the Dark Ages, I was told that the answer to all this was to set the seed of the pseudorandom number stream explicitly. This was after the introduction of pseudorandom number generators, of course. Before that, random numbers were generated based on things like fluctuations in the voltage of the power supplied to the mainframe, or readings from a thermocouple stuck ... well, never mind. Today hardware random number generators are used mainly in cryptography or gambling, where you explicitly do not want reproducibility. With pseudorandom numbers, using the same seed will produce the same sequences of "random" values, which should hypothetically make reproducibility possible.
I think I originally came across that in the context of simulation, but it also applies in optimization, and not just in obviously random heuristics and metaheuristics. CPLEX has a built-in pseudorandom number generator which is used for some things related to branching decisions when solving integer programs (and possibly other places too). So explicitly setting a seed for both the random number generator used by my heuristic and CPLEX should make my code reproducible, right?
Wrong. There are two more wildcards here. One is that my Java code uses various types of collections (HashMaps, HashSets, ...) and also uses Streams. Both of those can behave in nondeterministic ways if you are not very careful. (Whether a Stream is deterministic may boil down to whether the source is. I'm not sure about that.) The other, and I'm pretty sure this bites me in the butt more often than any other aspect, is the use of parallel threads. My code runs multiple copies of the heuristic in parallel (using different seeds), and CPLEX uses multiple threads. The problem with parallel threads is that timing is nondeterministic, even if the sequence of operations in each thread is. The operating system will interrupt threads willy-nilly to use those cores for other tasks, such as updating the contents of the display, doing garbage collection in Java, pinging the mail server or asking Skynet whether it's time to terminate the user). If there is any communication between the processes running in parallel threads in your code, the sequencing of the messages will be inherently random. Also, if you give a process a time limit and base it on "wall clock" time (which I'm prone to doing), how far the process gets before terminating will depend on how often and for how long it was interrupted.
Limiting everything to a single thread tends not to be practical, at least for the problems I find myself tackling. So I'm (somewhat) reconciled to the notion that "reproducibility" in what I do has to be interpreted somewhat loosely.