A current research project involves optimization models containing large numbers of what are basically set covering constraints, constraints of the form $$\sum_{i\in S} x_i \ge 1,$$ where the $x_i$ are binary variables and $S$ is some subset of the set of all possible indices. The constraints are generated on the fly (exactly how is irrelevant here).

In some cases, the same constraint may be generated more than once, since portions of the code run in parallel threads. Duplicates need to be weeded out before the constraints are added to the main integer programming model. Also, redundant constraints may be generated. By that, I mean we may have two cover constraints, summing over sets $S_1$ and $S_2$, where $S_1 \subset S_2$. When that happens, the first constraint implies the second one, so the second (weaker) constraint is redundant and should be dropped.

So there comes a "moment of reckoning" where all the constraints generated by all those parallel threads get tossed together, and duplicate or redundant ones need to be weeded out. That turns out to be a rather tedious, time-consuming operation, which brings me to how the constraints are represented. I'm coding in Java, which has various implementations of a Set interface to represent sets. The coding path of least resistance would be to toss the indices for each constraint into some class implementing that interface (I generally gravitate to HashSet). The Set interface defines an equals() method to test for equality and a containsAll() method to test whether another set is a subset of a given set. So this would be pretty straightforward to code.

The catch lies in performance. I have not found it documented anywhere, but I

*suspect*that adding elements to a HashSet is $O(n)$ while checking subset status or equality is $O(n^2)$, where $n$ is the number of possible objects (indices). The reason I say $O(n^2)$ for the latter two operations is that, in the worst case, I suspect that Java takes each object from one subset and compares it to every object in the other set until it finds a match or runs out of things to which to compare. That means potentially $O(n)$ comparisons for each of $O(n)$ elements of the first set, getting us to $O(n^2)$.

A while back, I took the excellent (and free) online course "Algorithms, Part 1", offered by a couple of faculty from my alma mater Princeton University. I believe it was Robert Sedgewick who said at one point (and I'm paraphrasing here) that sorting is cheap, so if you have any inkling it might help, do it. The binary variables in my model represent selection or non-selection of a particular type of object, and I assigned a complete ordering to them in my code. By "complete ordering" I mean that, given two objects $i$ and $j$, I can tell (in constant time) which one is "preferable". Again, the details do not matter, nor does the plausibility (or implausibility) of the order I made up. It just matters that things are ordered.

So rather than just dump subscripts into HashSets, I created a custom class that stores them in a TreeSet, a type of Java set that maintains sort order using the ordering I created. The custom class also provides some useful functions. One of those functions is isSubsetOf(), which does pretty much what it sounds like: A.isSubsetOf(B) returns true if set $A$ is a subset of set $B$ and false if not.

In the isSubsetOf() method, I start with what are called iterators for the two sets $A$ and $B.$ Each starts out pointing to the smallest member of its set, "smallest" defined according to the ordering I specified. If the smallest member of $B$ is bigger than the smallest member of $A$, then the first element of $A$ cannot belong to $B$, and we have our answer: $A\not\subseteq B$. If the smallest element of $B$ is smaller than the smallest element of $A$, I iterate through element of $B$ until either I find a match to the smallest element of $A$ or run out of elements of $B$ (in which case, again, $A\not\subseteq B$). Suppose I do find a match. I bump the iterator for $A$ to find the second smallest element of $A$, then iterate through subsequent members of $B$ (picking up where I left off in $B$, which is important) until, again, I get a match or die trying. I keep doing this until I get an answer or run out of elements of $A$. At that point, I know that $A\subseteq B$.

What's the payoff for all this extra work? Since I look at each element of $A$ and each element of $B$ at most once, my isSubstOf() method requires $O(n)$ time, not $O(n^2)$ time. Using a TreeSet means the contents of each set have to be sorted at the time of creation, which is $O(n\log n)$, still better than $O(n^2)$. I actually did code it both ways (HashSet versus my custom class) and timed them on one or two moderately large instances. My way is in fact faster. Without having a bit of exposure to computer science (including the Princeton MOOC), though, it would never have occurred to me that I could speed up what was proving to be a bottleneck in my code.

Hi Paul: nice read. One quick question: why does "throw it to the solver" strategy not work? Wouldn't most of the current gen solvers have internal routines for weeding out duplicate, or weaker constraints through presolve?

ReplyDeleteRegards.

The name "TreeSet" sounds like you are storing the indices in a tree somehow. I think a sorted array (int[]) would be even faster due to memory layout.

ReplyDeleteI'm not sure this is a good example of the usefulness of computer science. If you just go by the observation that hash table insertion is O(n) in the worst case, you should never use them for anything; but of course this worst case only holds with carefully crafted malicious inputs, and in expectation it is O(1). So you should only expect the TreeSet to be asymptotically faster if there are for some reason a lot of hash collisions in HashSet. I suspect what you observe is mostly an artifact of the actual implementations in Java and it might as well be different with some other implementation.

ReplyDeleteWhere computer science might be more useful is by observing, at a higher level, that "keep a family of sets while avoiding adding subset" is a well-known problem known as "finding extremal sets". Unfortunately, in this case, no fundamentally faster algorithm than trying all pairs is known, but there are good heuristics (see e.g. https://arxiv.org/abs/1508.01753). It might be that this is overkill for your application, though.