## Thursday, July 1, 2021

### Smallest Pairwise Difference in R

In a recent blog post, OR blogger Erwin Kalvelagen included some R code for a genetic algorithm, including an objective function that takes as input a vector $x\in\Re^n$ and returns the smallest absolute pairwise difference in its elements. The actual objective calls for the smallest difference in consecutive values, but Erwin correctly notes that the smallest absolute difference in any pair will automatically occur in consecutive values, so the "all pairs" approach yields the same objective value (and is considerably friendlier in the MIP and MINLP models he proposes).

I did a little experimenting and confirmed his opinion that the GA is unlikely to be competitive with the MIP model. That led me to a tangent involving the way the objective function for the GA is coded. Erwin used the obvious approach: two nested for loops to find all pairs of values once, avoiding comparing $x_i$ with $x_j$ and then (redundantly) $x_j$ with $x_i$, and avoiding comparing $x_i$ with itself. This approach does $O(n^2)$ work. An alternative approach is to first sort the vector $x$, which takes $O(n \log n)$ work, and then compute consecutive differences and their minimum value ($O(n)$). I put together a little R code to compare the two, and unsurprisingly the method with sorting is faster (and much faster when $n$ gets big).

There is another wrinkle to this. I've seen a number of articles and comments online asserting that explicit looping in R (as with Erwin's nested for loops) is inefficient, and should be avoided at all costs in favor of using vectorized functions (where the looping is presumably coded in C or C++ and baked into the compiled function). I've also seen contrary opinions saying that the concern about looping is overstated. There is also, I think, a middle ground: even if explicit loops are inefficient, that's likely only a concern if you are looping over very large objects, or looping many times, or both. Erwin's test case has $n=50$, which I suspect is not large enough to be worth worrying about the efficiency or inefficiency of looping (even though the GA will evaluate the objective function repeatedly).

So to test this, I cobbled together an R notebook (which you can find here) that tests all three versions of the objective function on vectors of various dimensions. As I thought, the sorted method dominates. Even at $n=50$ it's competitive with looping (although looping appears to be slightly faster), but at $n=2,000,000$ the sorting method takes about one third of a second on my PC (using R 4.1), whereas I estimate the nested loop method would take about 10 days (!).

The third method uses the (vectorized) outer product operator in R. It computes all $n^2$ absolute differences, whereas the nested loops calculate the $n(n-1)/2$ unique (nontrivial) differences, so in theory it does about twice the work the nested for loops do. Despite that, it is faster than the nested loops (though not nearly as fast as sorting). For $n$ around 5,000 to 10,000, the outer product approach seems to be about 10x faster than the nested loops.

So I take away two conclusions from this. The first is confirmation of a bit of wisdom that I believe I heard from Princeton computer scientist Robert Sedgewick in a MOOC based on his book (with Kevin Wayne) "Algorithms". Sorting is computationally cheap, so if you think it might help with a calculation, try it. The second is confirmation of the assertion that, even in the latest version (4.1) of R, explicit looping is probably slower, and quite possibly by a noticeable amount, than using vectorized methods ... when there are vectorized methods available.

If you are curious about the code for the alternative methods, it's embedded in the notebook.

#### 2 comments:

1. Thanks for this. I am afraid I updated my blog entry and this now shows your more efficient, and also more elegant sorted objective. But, indeed: I admit that my first attempt was a double loop over all possible combinations of 2 points. Another, unrelated question I am still pondering: why is ga not working very well on this problem? Can we predict this or do we just have to try?

1. I'm curious myself about why the GA is not competitive (at least on small problems). It likely has something to do with the way the solution is encoded. The encoding you chose is the obvious one, and really about the only one I can see. I vaguely recall reading something, back in the early days of GAs, that emphasized the importance of an encoding that captured the right information in the right way. It may be that, with this encoding, crossover does not work well because gluing together the first half of a good solution and the second half of a good solution too often produces a poor solution (due to the min difference dropping as a result of the placement of the last few points in the first part and the first few points in the second part, regardless of how well the rest of the points are spaced).

Due to intermittent spamming, comments are being moderated. If this is your first time commenting on the blog, please read the Ground Rules for Comments. In particular, if you want to ask an operations research-related question not relevant to this post, consider asking it on Operations Research Stack Exchange.