## Thursday, July 1, 2021

### Smallest Pairwise Difference in R

In a recent blog post, OR blogger Erwin Kalvelagen included some R code for a genetic algorithm, including an objective function that takes as input a vector $x\in\Re^n$ and returns the smallest absolute pairwise difference in its elements. The actual objective calls for the smallest difference in consecutive values, but Erwin correctly notes that the smallest absolute difference in any pair will automatically occur in consecutive values, so the "all pairs" approach yields the same objective value (and is considerably friendlier in the MIP and MINLP models he proposes).

I did a little experimenting and confirmed his opinion that the GA is unlikely to be competitive with the MIP model. That led me to a tangent involving the way the objective function for the GA is coded. Erwin used the obvious approach: two nested for loops to find all pairs of values once, avoiding comparing $x_i$ with $x_j$ and then (redundantly) $x_j$ with $x_i$, and avoiding comparing $x_i$ with itself. This approach does $O(n^2)$ work. An alternative approach is to first sort the vector $x$, which takes $O(n \log n)$ work, and then compute consecutive differences and their minimum value ($O(n)$). I put together a little R code to compare the two, and unsurprisingly the method with sorting is faster (and much faster when $n$ gets big).

There is another wrinkle to this. I've seen a number of articles and comments online asserting that explicit looping in R (as with Erwin's nested for loops) is inefficient, and should be avoided at all costs in favor of using vectorized functions (where the looping is presumably coded in C or C++ and baked into the compiled function). I've also seen contrary opinions saying that the concern about looping is overstated. There is also, I think, a middle ground: even if explicit loops are inefficient, that's likely only a concern if you are looping over very large objects, or looping many times, or both. Erwin's test case has $n=50$, which I suspect is not large enough to be worth worrying about the efficiency or inefficiency of looping (even though the GA will evaluate the objective function repeatedly).

So to test this, I cobbled together an R notebook (which you can find here) that tests all three versions of the objective function on vectors of various dimensions. As I thought, the sorted method dominates. Even at $n=50$ it's competitive with looping (although looping appears to be slightly faster), but at $n=2,000,000$ the sorting method takes about one third of a second on my PC (using R 4.1), whereas I estimate the nested loop method would take about 10 days (!).

The third method uses the (vectorized) outer product operator in R. It computes all $n^2$ absolute differences, whereas the nested loops calculate the $n(n-1)/2$ unique (nontrivial) differences, so in theory it does about twice the work the nested for loops do. Despite that, it is faster than the nested loops (though not nearly as fast as sorting). For $n$ around 5,000 to 10,000, the outer product approach seems to be about 10x faster than the nested loops.

So I take away two conclusions from this. The first is confirmation of a bit of wisdom that I believe I heard from Princeton computer scientist Robert Sedgewick in a MOOC based on his book (with Kevin Wayne) "Algorithms". Sorting is computationally cheap, so if you think it might help with a calculation, try it. The second is confirmation of the assertion that, even in the latest version (4.1) of R, explicit looping is probably slower, and quite possibly by a noticeable amount, than using vectorized methods ... when there are vectorized methods available.

If you are curious about the code for the alternative methods, it's embedded in the notebook.

1. 1. 