Tuesday, July 13, 2021

Big-M versus Goals (Part II)

This is a continuation of my previous post, in which I discussed the common modeling issue of using a binary variable to enforce or relax a linear constraint. The standard approach to this uses what is commonly known as a "big M" constraint, but there are alternatives. The two that I mentioned in the previous post were a version of Benders decomposition frequently termed "combinatorial Benders cuts" (CBC) and an approach specific to CPLEX using disjunctive "goals". The previous post illustrated this using a mixed integer linear program (MILP) for the two-group discriminant problem.

I've used both "big M" and CBC before, but I had never used goals, and I was curious about the relative performance of the three methods. So I coded the discriminant problem from the previous post in Java (with a slightly different objective function, which is irrelevant to the discussion at hand) and ran the various models using four data sets that I have left over from a paper I wrote 30-ish years ago. The data sets relate to breast cancer (444+239), diabetes (500+268), liver disease (145+200) and (just to prove I don't have a morbid fascination with disease) radar returns (126+225). The numbers in parentheses are the sizes of the two samples.

I ran the problems using CPLEX 20.1 with a 10 minute time limit and default values for all parameters. For the CBC model, I used generic callbacks with a local copy of the LP subproblem for each thread (which will relate to memory use). The table below shows, for each combination of data set and model, the run time (in seconds), the final objective value (lower is better), the final bound (bigger is better) and the approximate peak memory use (in gigabytes, omitted for cases where the memory use was "trivial").

Data Model Time (s.) Objective Bound Memory (GB)
Cancer Big M 2.2 0.0124 0.0124 ---
CBC 6.5 0.0124 0.0124 ---
600 0.0124 0.0056 1.0
Diabetes Big M 600 0.2057 0.0614 0.2
CBC 600 0.3966 0.0598 2.9
600 0.2239 0.0038 4.1
Liver Big M 600 0.2521 0.1243 0.2
CBC 600 0.3244 0.1013 7.6
600 0.2639 0.0169 2.8
Radar Big M 91 0.0190 0.0190 ---
CBC 139 0.0186 0.0186 0.2
600 0.0190 0.0066 0.2


Before forging ahead, I'll note one small anomaly in the results for the radar data set. Both the "big M" and CBC methods got "provably optimal" solutions (lower bound = upper bound), but the CBC solution was slightly better (misclassifying six observations, versus seven for "big M"). I think this is just an issue with tolerance settings (constraint satisfaction tolerance, integrality tolerance or a combination of the two).

I've said on more than one occasion that the only valid generalization about integer programs is that there are no other valid generalizations about integer programs. That won't stop me from looking for possible patterns in the table, but it is important to keep in mind not only the small number of problems tried (four) and the specific nature (two-group discriminant analysis), but also the fact that I have fairly tight, data-based values of $M_i$ for the "big M" models. In two data sets, "big M" and CBC both got proven optimal solutions (with CBC being slower but not particularly slow). Both times, the goal approach also found an optimal solution (or optimal-ish in the radar case), but was nowhere getting the bound close to the optimal value. In the other two cases, goals did better on the primal side (found a better incumbent solution) but CBC did better on the dual side (found a tighter bound). Since CBC is busy hacking off infeasible candidate solutions and the goal approach is trying to satisfy goals, I suspect this pattern might generalize a bit.

On the two more difficult data sets (diabetes and liver disease), both CBC and goals used much more memory than "big M" did. In both cases, portions of the node file were being compressed and possibly shipped off to disk by CBC and goals, whereas "big M" never got to the point of consuming very much memory. I'll confess that I was expecting goals to be much more of a memory hog than CBC, since the goal "stack" gets replicated at every node, but CBC used much more memory than goals on the liver disease data set. Although I did not keep count, my guess is that indicates CBC was cranking out a huge number of Benders cuts. I think the memory use can be mitigated (somewhat) by making the cuts purgeable, but I did not bother to test that.

My last take-away: on the easy problems "big M" was fastest, and on the challenging problems "big M" had both the best incumbent value and the tightest bound. Again, this particular set of problems benefits from fairly tight $M_i$ values. So, henceforth, I will likely stick to "big M" whenever I think I have good values for $M$.

Thursday, July 8, 2021

Big-M versus Goals

Much has been written (including by yours truly) about the trials and tribulations of "big M" integer programming models. A common use of "big M" is to let binary variables turn constraints on or off. So, for instance, $$a^\prime x \le b + Mz\quad (1)$$with $x$ a vector of continuous variables and $z$ a binary variable is intended to enforce $a^\prime x \le b$ when $z=0$ and not enforce it when $z=1$.

Large values of $M$ can contribute to weak relaxations (leading to slow progress on the bound), "leakage" (where a value of $z$ can be close enough to 0 that the solver considers it 0 to within rounding error while making $Mz$ big enough to relax the constraint), and various numerical problems. Small values of $M$ may make the constraint binding when it is supposed to be relaxed. Still, for most people the "big M" approach is the only way they know to model certain problem features.

One alternative, at least in some cases, is to use "combinatorial Benders decomposition" [1], in which the constraints in question are either enforced or ignored in a subproblem depending on the values of $z$ in the master problem. The good news is that this eliminates any worries about choosing $M$, since no coefficients $M$ are needed. The bad news is that (a) Benders decomposition is a bit advanced for many users, (b) it may require custom programming (as opposed to making the model in a high-level language and letting the modeling environment pass it to the solver), and (c) Benders decomposition is an "outer approximation", so the best bound may be a bit leisurely in converging to the optimum.

There is another alternative available, at least with CPLEX. Recent versions of CPLEX support "goals". The user's manual does a masterful job of not actually defining what a goal is -- according to the CPLEX 20.1 manual, goals are things that "allow you to take control of the branch & cut search procedure used by IBM ILOG CPLEX to solve MIP problems". Basically, a goal is an alternative form of a constraint (I think), which rather than explicitly appearing in the constraint matrix is put in a stack of goals, passed to nodes when they are created, and somehow used to influence the creation of child nodes (I think).

The tie-in to today's topic is that one type of goal provided by CPLEX is an "or" goal, which is what it sounds like: a disjunction of two or more constraints or goals. So an alternative to writing constraint (1) with the dreaded $M$ would be to use an "or" goal $$a^\prime x \le b \mathrm{\quad OR \quad} z=1.\quad (2)$$

I was curious about how well this would work, so I tried to do a comparison between "big-M" and goal-based models for a two-group discriminant problem. The gist of the model is as follows. We have as data a sample of vectors $x_i\in \mathbb{R}^n$ from two groups. Let $G_0$ and $G_1$ denote the indices belong to the first and second groups respectively. We want to find coefficients $w\in \mathbb{R}^n$, $w_0 \in \mathbb{R}$ for a linear function $f(x) = w^\prime x + w_0$ such that $f(x) \lt 0$ predicts membership of $x$ in the first group and $f(x) \gt 0$ predicts membership in the second group.

The specific model I started with (from some research I did in my much younger days) includes one more variable $d\ge \delta$ (where $\delta$ is some small positive constant) and binary variables $z_i$ signaling whether an observation is correctly ($z_i =0$) or incorrectly ($z_i=1$) classified. Variable $d$ captures the minimum absolute score of a correctly classified observation, which in essence represents the amount of separation between (correct) scores for the two groups. If $d$ is too small, you may end up classifying observations positive or negative based on what amounts to rounding error, hence the lower bound on $d$.

The "big M" version is as follows: $$\begin{align*} \min\quad\sum_{i}z_{i}-\epsilon d\\ \mathrm{s.t.}\quad w^{\prime}x_{i}+w_{0}+d & \le\phantom{-}M_{i}z_{i}\quad i\in G_{0}\\ w^{\prime}x_{i}+w_{0}-d & \ge-M_{i}z_{i}\quad i\in G_{1}\\ -1\le w & \le1\\ w_{0} & \quad \mathrm{free}\\ d & \ge\delta\\ z_{i} & \in\left\{ 0,1\right\} \quad\forall i. \end{align*}$$The model minimizes the number of misclassifications with a secondary criterion of maximizing separation. The coefficient $\epsilon$ is chosen to keep the objective contribution small enough that the solver is not tempted to make unnecessary misclassifications just to boost the value of $d$. Putting bounds on $w$ prevents huge coefficients for the classifier (which again could result in decisions being made based on rounding error). The model has been shown to work correctly.

The goal version of the model keeps the bounds on the variables and objective function but replaces all the "big-M" constraints with disjunctions of the form $w^\prime x_i +w_0 +d \le 0$ or $z_i=1$ for $i\in G_0$ and similarly for $i\in G_1$. In other words, "classify this observation correctly or pay the price for misclassifying it". I coded both models in Java and ran a test case, expecting both to produce an optimal classifier but unsure which would be faster. There was an unpleasant surprise waiting for me: CPLEX declared the goal-based model unbounded! It was right. You can satisfy all the disjunctions by declaring all the observations misclassified ($z_i = 1$ for all $i$). That lets you choose an arbitrarily large value for $d$, large enough that $\epsilon d$ is arbitrarily bigger than the sum of the $z$ variables, making the objective arbitrarily negative.

This is not a problem with the "big M" model, because no matter how large you make $M_i$, you still have a finite bound on the left side of each constraint. The fix was to come up with a defensible upper bound for $d$ and add it to the goal model, making the goal model bounded. With that fix in place, both models arrived at optimal solutions, in what was comparable time for the one test case I have run so far.

So the takeaway here is that if you want to use disjunctions to avoid "big M", you may need to take extra care to ensure that your model is bounded.

[1] Codato, G. and Fischetti, M. Combinatorial Benders' Cuts for Mixed-Integer Linear Programming. Operations Research 54(4), 2006, 756-766.

Monday, July 5, 2021

Vertex Numbering via GA

A question on OR Stack Exchange asks how to number vertices in a layered graph so that the endpoints of edges have similar numbers. Note that "numbering" the vertices here means numbering within layers (so that, for instance, every layer has a vertex numbered 1). We will assume that every node has a unique ID before we launch into the numbering process. The author of the question chose the sum of squared differences of endpoint labels for all edges as the objective function to minimize. The following image shows an example of a layered graph with a (probably suboptimal) numbering scheme. The numbers on the edges are their contributions to the objective function.

example graph with three layers

The prolific Rob Pratt noted that the problem can be modeled as an assignment problem with a quadratic objective function, using binary variables. That model produces an exact solution, given sufficient time and memory.

Note that numbering the nodes within a layer is equivalent to picking one of the possible permutations of the node IDs. The author of the question indicated receptiveness to a metaheuristic, so I decided to try coding a random key genetic algorithm (RKGA) for the problem. I've mentioned RKGAs before (for instance, here and here). As I understand it, they were originally designed for sequencing / scheduling problems, where things need to be permuted optimally, so an RKGA seemed like a natural choice. I coded both the integer programming (IP) model and the RKGA in Java, using CPLEX 20.1 as the IP solver and Watchmaker Framework 0.7.1 for the GA. The Watchmaker Framework has not been under active development for quite a few years, but it works well.

To use an RKGA, you need to come up with a coding for a "chromosome" (candidate solution) and a mechanism for decoding the chromosome into a solution to the original problem (in this case separate vertex permutations for each graph layer) such that the decoded chromosome is always feasible. I chose as my chromosome representation a double-precision vector with elements between 0 and 1, having one element per vertex in the original graph. Double-precision is probably overkill, but I'm in the habit of using double-precision rather than single-precision, so it was the path of least resistance. To decode the chromosome, I first had to chop it up into smaller vectors (one per layer) and then extract the sort index of each smaller vector. So, using the image above as an example, a chromosome would be a double vector with 2 + 5 + 3 = 10 components. If the last three components were (0.623, 0.021, 0.444) the sort indices would be (3, 1, 2), yielding the vertex numbering for the last layer in the image. To convert a double vector into a sort index vector, I used a Java library (ValueOrderedMap, described in this post) that I wrote some time back.

A GA can "stagnate", meaning cease to improve on the best known solution. One obvious reason for stagnation is that it cannot improve on an optimal solution, but stagnation can occur for other reasons. On small test problems, the GA tended to stagnate rather quickly, so I set a stagnation limit and put the GA in a loop that would restart it up to nine times or until a specified time limit (five minutes ... I wasn't very ambitious). On larger test problems, it was not as quick to stagnate, but it eventually did.

I used the GA's best solution as a starting solution for the IP model. On smaller test problems, CPLEX occasionally closed the gap to zero within five minutes but usually did not. On larger problems, CPLEX struggled to get the best bound above zero (100% gap) within five minutes. So I suspect the IP model is prone to loose bounds. An example of a "small" test problem is one with 29 vertices spread across five layers and 35 edges. (I worked mainly with sparse examples, to keep the size of the IP model down.)

Interestingly, in none of the trials did CPLEX improve on the initial solution provided by the GA. So it might be that the GA was consistently finding an optimum, although I cannot prove that. (On the few smaller instances where CPLEX reached provable optimality, the optimal solution was indeed the GA solution.) Note that, while the GA solution appeared optimal, that does not mean that each GA run produced an optimum. On the 29 vertex example mentioned above, CPLEX found a provable optimum with objective value 111. The GA ran 10 times (using about 92 seconds total), and of the 10 runs two stagnated at 111, five at 112, and three at 114. So even if the GA is prone to finding optimal solutions, it may require multiple starts to do so.

Thursday, July 1, 2021

Smallest Pairwise Difference in R

 In a recent blog post, OR blogger Erwin Kalvelagen included some R code for a genetic algorithm, including an objective function that takes as input a vector $x\in\Re^n$ and returns the smallest absolute pairwise difference in its elements. The actual objective calls for the smallest difference in consecutive values, but Erwin correctly notes that the smallest absolute difference in any pair will automatically occur in consecutive values, so the "all pairs" approach yields the same objective value (and is considerably friendlier in the MIP and MINLP models he proposes).

I did a little experimenting and confirmed his opinion that the GA is unlikely to be competitive with the MIP model. That led me to a tangent involving the way the objective function for the GA is coded. Erwin used the obvious approach: two nested for loops to find all pairs of values once, avoiding comparing $x_i$ with $x_j$ and then (redundantly) $x_j$ with $x_i$, and avoiding comparing $x_i$ with itself. This approach does $O(n^2)$ work. An alternative approach is to first sort the vector $x$, which takes $O(n \log n)$ work, and then compute consecutive differences and their minimum value ($O(n)$). I put together a little R code to compare the two, and unsurprisingly the method with sorting is faster (and much faster when $n$ gets big).

There is another wrinkle to this. I've seen a number of articles and comments online asserting that explicit looping in R (as with Erwin's nested for loops) is inefficient, and should be avoided at all costs in favor of using vectorized functions (where the looping is presumably coded in C or C++ and baked into the compiled function). I've also seen contrary opinions saying that the concern about looping is overstated. There is also, I think, a middle ground: even if explicit loops are inefficient, that's likely only a concern if you are looping over very large objects, or looping many times, or both. Erwin's test case has $n=50$, which I suspect is not large enough to be worth worrying about the efficiency or inefficiency of looping (even though the GA will evaluate the objective function repeatedly).

So to test this, I cobbled together an R notebook (which you can find here) that tests all three versions of the objective function on vectors of various dimensions. As I thought, the sorted method dominates. Even at $n=50$ it's competitive with looping (although looping appears to be slightly faster), but at $n=2,000,000$ the sorting method takes about one third of a second on my PC (using R 4.1), whereas I estimate the nested loop method would take about 10 days (!).

The third method uses the (vectorized) outer product operator in R. It computes all $n^2$ absolute differences, whereas the nested loops calculate the $n(n-1)/2$ unique (nontrivial) differences, so in theory it does about twice the work the nested for loops do. Despite that, it is faster than the nested loops (though not nearly as fast as sorting). For $n$ around 5,000 to 10,000, the outer product approach seems to be about 10x faster than the nested loops.

So I take away two conclusions from this. The first is confirmation of a bit of wisdom that I believe I heard from Princeton computer scientist Robert Sedgewick in a MOOC based on his book (with Kevin Wayne) "Algorithms". Sorting is computationally cheap, so if you think it might help with a calculation, try it. The second is confirmation of the assertion that, even in the latest version (4.1) of R, explicit looping is probably slower, and quite possibly by a noticeable amount, than using vectorized methods ... when there are vectorized methods available.

If you are curious about the code for the alternative methods, it's embedded in the notebook.