I thought I would follow up on my June 29 post, "
Does Computer Science Help with OR?", by giving a quick example of how exposure to fundamentals of computer science recently helped me.
A current research project involves optimization models containing large numbers of what are basically set covering constraints, constraints of the form $$\sum_{i\in S} x_i \ge 1,$$ where the $x_i$ are binary variables and $S$ is some subset of the set of all possible indices. The constraints are generated on the fly (exactly how is irrelevant here).
In some cases, the same constraint may be generated more than once, since portions of the code run in parallel threads. Duplicates need to be weeded out before the constraints are added to the main integer programming model. Also, redundant constraints may be generated. By that, I mean we may have two cover constraints, summing over sets $S_1$ and $S_2$, where $S_1 \subset S_2$. When that happens, the first constraint implies the second one, so the second (weaker) constraint is redundant and should be dropped.
So there comes a "moment of reckoning" where all the constraints generated by all those parallel threads get tossed together, and duplicate or redundant ones need to be weeded out. That turns out to be a rather tedious, time-consuming operation, which brings me to how the constraints are represented. I'm coding in Java, which has various implementations of a
Set interface to represent sets. The coding path of least resistance would be to toss the indices for each constraint into some class implementing that interface (I generally gravitate to
HashSet). The
Set interface defines an
equals() method to test for equality and a
containsAll() method to test whether another set is a subset of a given set. So this would be pretty straightforward to code.
The catch lies in performance. I have not found it documented anywhere, but I
suspect that adding elements to a
HashSet is $O(n)$ while checking subset status or equality is $O(n^2)$, where $n$ is the number of possible objects (indices). The reason I say $O(n^2)$ for the latter two operations is that, in the worst case, I suspect that Java takes each object from one subset and compares it to every object in the other set until it finds a match or runs out of things to which to compare. That means potentially $O(n)$ comparisons for each of $O(n)$ elements of the first set, getting us to $O(n^2)$.
A while back, I took the excellent (and free) online course "
Algorithms, Part 1", offered by a couple of faculty from my alma mater Princeton University. I believe it was
Robert Sedgewick who said at one point (and I'm paraphrasing here) that sorting is cheap, so if you have any inkling it might help, do it. The binary variables in my model represent selection or non-selection of a particular type of object, and I assigned a complete ordering to them in my code. By "complete ordering" I mean that, given two objects $i$ and $j$, I can tell (in constant time) which one is "preferable". Again, the details do not matter, nor does the plausibility (or implausibility) of the order I made up. It just matters that things are ordered.
So rather than just dump subscripts into
HashSets, I created a custom class that stores them in a
TreeSet, a type of Java set that maintains sort order using the ordering I created. The custom class also provides some useful functions. One of those functions is
isSubsetOf(), which does pretty much what it sounds like:
A.isSubsetOf(B) returns true if set $A$
is a subset of set $B$
and false if not.
In the
isSubsetOf() method, I start with what are called iterators for the two sets $A$
and $B.$ Each starts out pointing to the smallest member of its set, "smallest" defined according to the ordering I specified. If the smallest member of $B$ is bigger than the smallest member of $A$, then the first element of $A$ cannot belong to $B$, and we have our answer: $A\not\subseteq B$. If the smallest element of $B$ is smaller than the smallest element of $A$, I iterate through element of $B$ until either I find a match to the smallest element of $A$ or run out of elements of $B$ (in which case, again, $A\not\subseteq B$). Suppose I do find a match. I bump the iterator for $A$ to find the second smallest element of $A$, then iterate through subsequent members of $B$ (picking up where I left off in $B$, which is important) until, again, I get a match or die trying. I keep doing this until I get an answer or run out of elements of $A$. At that point, I know that $A\subseteq B$.
What's the payoff for all this extra work? Since I look at each element of $A$ and each element of $B$ at most once, my i
sSubstOf() method requires $O(n)$ time, not $O(n^2)$ time. Using a
TreeSet means the contents of each set have to be sorted at the time of creation, which is $O(n\log n)$, still better than $O(n^2)$. I actually did code it both ways (
HashSet versus my custom class) and timed them on one or two moderately large instances. My way is in fact faster. Without having a bit of exposure to computer science (including the Princeton MOOC), though, it would never have occurred to me that I could speed up what was proving to be a bottleneck in my code.