Sunday, December 16, 2018

Of Typewriters and Permutations (V)

Okay, this is the last post on the subject. I promise! If you're coming into this movie on  the last reel, you may need to skim the last few posts to see what it's about. I'm trying not to repeat myself too much.

To summarize where we are at: Hardmath123 posted a solution (generated by a heuristic) that might or might not be optimal; a constraint programming model gets that solution very quickly and fails to improve on it to the extent of my experience (I let it run four hours); and the MIP1 model was the least worst of the various optimization models I tried, but it had rather loose bounds.

Rob Pratt, in a comment to the second post, suggested a new constraint that does improve the bound performance of MIP1. Recall that the variables in MIP1 are $x_{i,j}$ (binary, does symbol $i$ go in slot $j$), $p_i$ (the position where symbol $i$ gets parked), and $d_{ij}$ (the distance between symbols $i$ and $j$ in the layout). Rob's constraint is$$\sum_{i} \sum_{j>i} d_{ij} = \frac{n^3 - n}{6}$$where $n$ is the size of the symbol set (26 in our case). I'll skip the derivation here; it's based on the observation that the sum of the distances between all pairs of positions is constant, regardless of who fills those positions.

That got me thinking, and I came up with a set of constraints that are similarly motivated. Let $\delta_j$ be the sum of the distances from position $j$ to all other positions. Since we are using zero-based indexing, there are $j$ positions to the left of position $j$ with distances $1,2,\dots,j$ (looking right to left) and $n -1-j$ positions to the right of $j$, with distances $1,2,\dots, n-1-j$ (looking left to right). Using the well known formula for summing a sequence of consecutive integers,$$\delta_j = \frac{j(j+1)}{2}+\frac{(n-1-j)(n-j)}{2}.$$It follows that if symbol $i$ lands in position $j$, $\sum_j d_{ij} = \delta_j$. Considering all possible landing spots for symbol $i$ leads to the following constraints:$$\sum_k d_{ik} = \sum_j \delta_j x_{ij}\quad \forall i.$$A little experimentation showed that the combination of those constraints plus Rob's constraint improved MIP1 more than either did alone. That's the good news. The bad news is that I still haven't been able to prove optimality for Hardmath123's solution. I ran MIP1 with the extra constraints (using branching priorities, which help), using the Hardmath123 solution as the start and changing the CPLEX emphasis setting to 3. That last bit tells CPLEX to emphasize pushing the bound, which makes sense when you think you have the optimal solution and you're trying to get to proof of optimality. After 338 minutes (5.6 hours), the incumbent had not improved (so it's looking good as a possible optimum), but the gap was still around 14.5%, and the search tree was north of 1.25 GB and growing. To get a sense of what I might call the "futility index", it took eleven minutes to shrink the gap from 14.58% to 14.48%.

Here are a few final takeaways from all this.
  • There are often multiple ways to write a MIP model for a discrete optimization problem. They'll often perform differently, and (at least in my experience) it's hard to anticipate which will do best.
  • Constraint programming has potential for generating good incumbent solutions quickly for some discrete optimization problems, particularly those whose structure fits with the available selection of global constraints in the solver. Permutation problems will meet this criterion. CP may not be particularly helpful for proving optimality, though, since the bounding is weak.
  • Seemingly redundant constraints can help ... but there's no guarantee. (Rob suggested an additional constraint in a comment to part III. I tried it, and it slowed things down.)
  • Some days you get the bear, and some days the bear gets you.

Saturday, December 15, 2018

Of Typewriters and Permutations (IV)

I'm continuing the recent theme of solving a quadratic assignment problem that lays out the 26 letters of the English alphabet on a one-dimensional "keyboard" for an 18th century typewriter. I thought this would be the last post, but something new turned up, so there will likely be one more.

In the blog post that started all this, Hardmath123 found a solution (via a heuristic) with cost 5,499,341. Using frequency data from Nate Brixius, I get a slightly higher cost (5,510,008), which is the value I will use for it (because, hey, it's my blog). Hardmath123 suspected but could not prove that this layout is optimal. I still can't verify optimality (but maybe in the next post I will ... or maybe not).

In my previous epistles on this subject, I tried out three MIP models and a quadratic (integer) program. In five minute runs using a beta copy of the next version of CPLEX, the best I was able to do was a solution with objective value 5,686,878.

Out of curiosity, I tried a constraint programming (CP) model (using CP Optimizer, the constraint solver in the CPLEX Studio suite). Constraint programming is best suited to models with integer variables (which we have here), although it can handle floating-point variables to some extent. It is well suited to logic constraints, and in particular is popular in scheduling, but it's not my go-to tool in part because it tends to have very weak bounding.

Defining the constraint programming model for the keyboard problem (denoted "CP" in my code) is fairly straightforward, but rather different from any of the MIP/QP models. We use general integer variables rather than binary variables to determine the position assignments. So $p_i\in \{0,\dots,25\}$ is the position of symbol $i$ in the layout, $s_j \in \{0,\dots,25\}$ is the symbol at position $j$, and $d_{ik}\in \{0,\dots,25\}$ is the distance between symbols $i$ and $k$. (Recall that we are using zero-based indexing.) As before, the objective is$$\min \sum_i \sum_j f_{ij}d_{ij},$$where $f_{ij}$ is the frequency with which symbol $j$ follows symbol $i$.

Where it gets interesting is in specifying the constraints. Although CP is designed for logic constraints ("or", "and", "if-then"), it's real power comes from what they call "global constraints", specialized constraints that tie multiple variables together. Arguably the most common global constraint is "all-different", meaning that in a group of variables no two of them can take the same value. It is the key to defining permutations, and in our model we add all-different constraints for both $p$ (no two symbols have the same position) and $s$ (no two slots have the same symbol). One or the other of those is technically redundant, but redundant constraints in a CP model can help the solver, so it's probably worth including both.

Another global constraint provided by CP Optimizer is "inverse". CP models allow variables to be used to index other variables, something we cannot do directly in math programming modes. The inverse constraint, applied to two vector variables $u$ and $v$, says that $u_{v_i}=i$ and $v_{u_i}=i$. In our model, we can apply the inverse constraint to $p$ and $s$. In effect, it says that the symbol in the position occupied by symbol $i$ is $i$, and the position of the symbol in position $j$ is $j$.

Finally, we can define the distance variables using the position variables:$$d_{ij} = |p_i - p_j|\quad \forall i,j.$$ How does the CP model perform? As I expected, the bounding is pretty bad. In a five minute run, the lower bound only makes it to 447,795 (91.87% gap). The good news is that the CP model finds Hardmath123's solution with objective value 5,510,008 ... and finds it in about 22 seconds on my PC! This is using the default search strategy; I did not assign branching priorities.

The take-away here, from my perspective, is that in some problems a CP model can be a great way to generate an optimal or near-optimal incumbent, which you can use as a starting solution in an optimization model if you need to prove optimality.

Friday, December 14, 2018

Of Typewriters and Permutations (III)

This continues the discussion (okay, monologue) from the two previous posts about the problem of laying out a one-dimensional typewriter keyboard. This is not the last post in the series, but I can at least guarantee that the series is converging.

In the previous post, I gave a MIP model (named MIP1) that used binary variables $x_{ij}$ to signal whether or not symbol $i$ was placed at position $j$ on the keyboard. It also used auxiliary variables $p_i$ for the position of symbol $i$ and $d_{ij}$ for the distance between symbols $i$ and $j$.

You might wonder whether those auxiliary variables are worth having. I did, after the somewhat disappointing (to me) lack of progress in the lower bound when I ran MIP1. So I omitted the position variables in a new model, denoted MIP2 in the code. (Reminder: My Java code is available and free to play with.) I kept the distance variables because they make it easier to define the objective function.

MIP2 contains the $x$ and $d$ variables from MIP1, along with the constraints that rows and columns of the $x$ matrix sum to 1. It also retains the objective function from MIP1. To tie $d_{ij}$ to $x_{i\cdot}$ and $x_{j\cdot}$, MIP2 contains the following constraints:$$d_{ij} \ge |m-k|\left(x_{ik} + x_{jm} -1\right)\quad \forall i\neq j, \forall k\neq m.$$These constraints can be entered as "normal" constraints or, using an option in the command line, as lazy constraints (meaning they'll be held off to side and activated whenever a node LP solution violates one of them). Since the distances are symmetric, we actually need only about half of these. Rather than risking an indexing error in my code, I added constraints of the form$$d_{ij}=d_{ji}\quad \forall i, \forall j<i$$and let the CPLEX presolver take care of eliminating the redundant constraints for me.

Was MIP2 more efficient than MIP1? Nope, it was worse ... much worse. In a five minute run, the lower bound got stuck at zero, leaving the gap at 100%. (With the cut Rob Pratt suggested in a comment to the previous post, the lower bound got off zero but froze at 192,550, so that the gap was "only" 97.13% by the end of five minutes.

Having no luck with MIP2, I went back to MIP1. In an effort to improve its performance, I made one tweak: rather than putting the branching priorities on the assignment variables ($x$), I declared the position variables ($p$) to be integer (in MIP1 they were continuous) and put the branching priorities on them. Thus higher priority was given to pinning down the positions of higher usage characters. This model is designated "MIP3" in the code. It did a fair bit worse than MIP1.

As mentioned in the post by Nate Brixius (and in my first post), the underlying problem looks like a quadratic assignment problem (QAP). So I tried a model with a quadratic objective function (identified as "QP" in the code). It contains only the $x$ variables and row/column sum constraints from MIP1. The objective function is$$\min \sum_i \sum_{j\neq i} \sum_k \sum_{m\neq k} f_{ij} |k - m| x_{ik} x_{jm}.$$This model had a better incumbent than MIP2 but a worse incumbent than either MIP1 or MIP3, and the lower bound was stuck at zero. (Since there are no distance variables, Rob Pratt's cut is not applicable here.)

Here is a summary of results after five minutes, using emphasis 2 (optimality), branching priorities, and Rob's cut:

Model Incumbent Bound Gap
MIP1 5,686,878 2,970,708 47.76%
MIP2 6,708,297 192,550 97.13%
MIP3 5,893,907 1,503,828 74.49%
QP 6,056,668 0 100.00%

I'm not quite done. There's one more model to try, coming up in what will hopefully be my last post on this.

Thursday, December 13, 2018

Of Typewriters and Permutations (II)

This continues my previous post about the problem of optimally laying out a one-dimensional typewriter keyboard, where "optimally" is taken to mean minimizing the expected amount of lateral movement to type a few selected books. As I noted there, Nate Brixius correctly characterized the problem as a quadratic assignment problem (QAP). I'll in fact try out a quadratic model subsequently, but my inclination is always to try to linearize anything that can't outrun or outfight me. So I'll start by discussing a couple of mixed integer linear program formulations.

The starting point for all the math programming formulations is a matrix of binary variables $x_{ij}\in \{0,1\}$, where $x_{ij}=1$ if and only if symbol $i$ is placed in slot $j$ in the keyboard. (Consistent with Nate's post, I'll be using zero-based indexing, so symbol index $i=0$ will correspond to the letter "A" and position index $j=0$ will correspond to the left edge of the keyboard.) Since each letter needs to be placed exactly once, we need the constraints $$\sum_{j=0}^{25}x_{ij} = 1\quad \forall i.$$Similarly, each slot can only contain one character, so we need the constraints $$\sum_{i=0}^{25} x_{ij} = 1\quad \forall j.$$Each row and column of $x$ can also be declared to be a type 1 specially ordered set (SOS1), but in CPLEX that tends to be useful only if you can assign "meaningful" weights to the variables in each set. I'll return to that later.

Recall from the previous post that we have a $26\times 26$ matrix $f$, where $f_{ij}$ is the frequency with which symbol $j$ immediately follows symbol $i$. We can also get an overall frequency with which each symbol is used by summing the corresponding row and column of $f$. I'll denote the use frequency for symbol $i$ by $g_i$, where $g_i=\sum_{j=0}^{25}(f_{ij} + f_{ji})$. I'll use that for a couple of things, one of which is to eliminate a bit of symmetry. As I noted in that previous post, if we reverse any layout (treat it as listing symbols from right to left rather than from left to right), we get the same objective value as that of the original layout. We can break that symmetry by selecting one symbol and arbitrarily requiring it to be in the left half of the keyboard. Although it probably does not make much difference to the solver which symbol we use, I'll selecting the most frequently used symbol. So let $g_{i^*}=\max_i g_i$ (breaking ties arbitrarily). We will add the constraints $x_{i^*j}=0$ for $j=13,\dots,25$.

Back to the SOS1 stuff. When you declare an SOS1 constraint in CPLEX, CPLEX wants weights. It uses the weights to do some branching on the sets. Branching at a node typically means selecting an integer variable and splitting its domain to create two children (generally by rounding the value of the variable in the node solution up or down). With an SOS1 constraint, CPLEX can partition the set of variables involved into two subsets. In either child node, one subset of variables is fixed at zero and the other subset remains unfixed. The weights are used to help CPLEX decide how to split the set of variables. Here, we can try declaring each column of $x$ to be an SOS1 using the cumulative frequencies. So we tell CPLEX for each $j$ that $(x_{0,j}, x_{1,j},\dots,x_{25,j})$ is an SOS1 with corresponding weights $(g_0, g_1,\dots, g_{25})$. In the code I posted, using SOS1 constraints is optional.

Another option in the code is to assign branching priorities to the variables. This encourages CPLEX to branch on variables with higher priorities before branching on variables with lower priorities. If you were laying out the keyboard heuristically, you would probably want to put high usage symbols ("Hello, 'e'!") toward the center of the keyboard, where they would be easy to reach, and lower usage symbols ("q"?) toward the edges. So I assigned to each variable $x_{ij}$ the priority $g_i \times \min(j, 25-j)$. Again, this is an option in the code.

If you're still awake at this point, you'll realize that I have not yet specified an objective function, which is where the linearization is needed. In my first MIP model ("MIP1" in the code), I did this by introducing a bunch of auxiliary variables. First, for each $i$ let $p_i\in [0,25]$ denote the position (slot) that symbol $i$ occupies. We define $p$ with the constraints $$p_i =\sum_{j=0}^{25} j \times x_{ij} \quad \forall i.$$Note that the $p_i$ do not need to be declared integer-valued. Armed with them, I can define another set of continuous variables $d_{ij}\in [0,25]$ for all $i$ and $j$, where $d_{ij}$ will be the distance between symbols $i$ and $j$ in the layout. Since $d_{ij}=|p_i - p_j|$ and we cannot use absolute values explicitly, we do the standard linearization of the absolute value function, adding the constraints $$d_{ij}\ge p_i - p_j \quad \forall i,j$$and$$d_{ij}\ge p_j - p_i \quad \forall i,j.$$(Yes, I know that when $i=j$ this gives me two copies of $d_{ii} \ge 0$. I'll let the presolver take care of that.) Now we have a very simple, clean expression of the objective function:$$\min \sum_{i=0}^{25}\sum_{j=0}^{25} f_{ij} d_{ij}.$$
How does this model do? I ran it for five minutes on a decent desktop PC (using four threads). I included both the branching priorities and the SOS1 constraints, but the CPLEX presolver eliminated all the SOS1 constraints as "redundant". It did that even if skipped the branching priorities, which irks me a bit. Someday maybe I'll figure out why it's ignoring those carefully defined SOS1 weights. At any rate,  I did the five minute run with MIP emphasis 2, which emphasizes proving optimality. After five minutes, the incumbent solution had objective value 5,706,873. That's a bit worse than the solution Hardmath123 got in the original post. (Speaking of which, Hardmath123 quoted an objective value of 5,499,341 and posted a layout. I get a value of 5,510,008 for that solution. It may be that Nate's frequency data, which I'm using, differs slightly from the frequency data Hardmath123 used.)

Unfortunately, after five minutes the gap was still 56.55%, and closing very slowly. (The last two minutes of that five minute run only closed the gap from about 57.5% to 56.5%.) I'm pretty sure the actual optimal value will be a lot closer to 5.5 million that to the last lower bound in the five minute run (2,479,745). So we're contending with a somewhat weak bound.

Update: A longer run, using MIP emphasis 3 (which focuses on improving the lower bound), still had a gap of 33% after four hours.

The incumbent was found after a bit less than four minutes (which will be relevant as I explore other models, in future posts). Still more to come on this ...

Wednesday, December 12, 2018

Of Typewriters and Permutations (I)

This is going to be the first of a few posts on the same problem. My efforts are based on a blog post by Nate Brixius (@natebrix) titled "Optimizing 19th Century Typewriters", which in turn is based on a post titled "Tuning a Typewriter" by "Hardmath123". As the image in Hardmath123's post shows, an old (very old!) Crown typewriter used a "keyboard" that had all the symbols in a single row. You shifted the row left or right to get to the symbol you wanted next, then pressed a key to enter it. There was a second key, the equivalent of "shift" on today's keyboards, that let you strike an alternate symbol perched above the primary symbol at each position in the lone row.

If we ignore numerals and punctuation marks (which both Hardmath123 and Nate did), the problem of finding an optimal layout for the keyboard amounts to finding the best permutation of the 26 letters A through Z. The criterion we are all adopting for "best" is requiring the least movement, in some average sense, to get to the next letter. This requires some estimate of how often each possible transition occurs. Hardmath123 and Nate estimated those transition frequencies from three books in the Project Gutenberg library. Since Nate was kind enough to share his frequency data with me, I'm indirectly using those same books.

To be consistent with Nate's post, I'll use zero-based indexing, so "A" will have index 0 and "Z" will have index 25. Similarly, the leftmost slot on the keyboard will have index 0 and the rightmost will have index 25. The metric we have all adopted is to minimize $\sum_{i=0}^{25} \sum_{j=0}^{j=25} f_{ij} |p_i - p_j|$, where $f_{ij}$ is the frequency with which symbol $j$ immediately follows symbol $i$ and $p_i$ is the position in which symbol $i$ is located (so that $|p_i - p_j|$ is the distance shifted moving from symbol $i$ to symbol $j$). Note that $i = j$ is perfectly fine, since a letter can follow itself.

A brute force solution would require $26!/2 \approx 2\times 10^{26}$ layouts to be evaluated. The division by 2 is because the problem is bilaterally symmetric: given any layout, the reverse of that layout will have the same objective value. (This relies implicitly on the fact that we do not differentiate between moving $n$ spaces to the left and moving $n$ spaces to the right.) Hardmath123 applied a heuristic that I would characterize as a version of 2-opt (with restarts) and found a solution with cost 5,499,341. Nate ran a heuristic by a third party and matched Hardmath123's result "within seconds". He also modeled the problem as a quadratic assignment problem (more on that later) and ran a solver he'd written himself, back in the day ... but did not let it run to completion, and did not say how good the incumbent was when he terminated the run.

I tried several models, in an attempt (as yet unsuccessful) to get a provably optimal solution. All the models used a beta copy of version 12.9 of IBM's CPLEX Studio, but perform similarly with the current production version (12.8). In subsequent posts, I'll document the various models I tried and give some results. Meanwhile, you are welcome to play with my Java code, the repository for which is publicly available. In addition to my Java code and the usual miscellany (README file, license file, ...), you'll find a text file (cro26_data.txt) with the frequencies being used, which Nate kindly is letting me share. To run the code, you will need a recent version of the CPLEX Studio suite, and you will need to put all three major pieces (CPLEX, CPOptimizer and OPL) on your library and class paths. You'll also need the open source Apache Commons CLI library, which I use to process command line options. Once you have the code installed and linked up, you can run the program with the command line option -h or --help (that's a double dash before "help") to get a list of what can/should be specified as options.

More to come ...

Friday, November 9, 2018

Stepwise Regression Code Revisited

I've added a few more tweaks to the stepwise regression code I published back in 2011. (If you wish, see here for the original post and here for a subsequent update.) The code does stepwise regression using F tests (or, equivalently, p-values of coefficients), which is a bit old fashioned but apparently how it is still taught some places. The latest update supplies default values for the alpha-to-enter and alpha-to-leave values. The default values (0 and 1 respectively) are consistent consistent with forward and backward stepwise. For forward stepwise, you would start with a bare-bones initial model, set your alpha-to-enter, and omit alpha-to-leave. For backward stepwise, you would start with a large initial model, set alpha-to-leave and omit alpha-to-enter. Both are demonstrated in the notebook.

The update also allows you to use the R shortcut of typing "." in a formula (meaning "all variables except the dependent variable"). The "." shortcut only works if you specify the data source as an argument to the function. You cannot use "." while omitting the data argument and relying on having the data source attached. Again, there are demonstrations in the notebook.

The code is free to use under a Creative Commons license. It comes in the form of an R notebook, which both defines the stepwise() function and does some demonstrations. From that web page, you should be able to download the notebook file using the select control labeled "Code" in the upper right corner. You can also get the files from my Git repository. The Git repository also has an issue tracker, although I think you will need to create an account in order to add an issue.

Monday, October 29, 2018

Pseudocode in LyX

Fair warning: This post is for LyX users only. When I'm writing a paper or presentation in LaTeX (using LyX, of course) and want to include a program chunk or algorithm in pseudocode, I favor the algorithmicx package (and specifically the algpseudocode style). There being no intrinsic support for the package in LyX, I have to enter the various commands as raw LaTeX (or, in LyX parlance, "ERT" -- "Evil Red Text", so named for historical reasons). Unfortunately, I do this seldom enough that I do not remember said commands, so each sojourn into pseudocode involves finding and opening the documentation file for algorithmicx.

I finally decided to fix that by writing a pseudocode module for LyX. The beta version (which seems pretty stable to me) is available under a GPLv2 license from my Github repository. Besides a "gitignore" file (which you should, as the name suggests, ignore), there is the module file and a combination user guide/demonstration LyX document. Installation instructions are included in the latter. You will, of course, need to install the algorithmicx package before using the module.

Should you decide to try it, there's an issue tracker in the repository where you can report bugs or feature requests. I hope someone else can get some use out of this, but even if not, the development time will pay itself back the next time I need to write some pseudocode.

Sunday, October 28, 2018

B.S.-ing Precisely

In a recent blog post titled "Excessive Precision", John D. Cook points out the foolishness of articulating results to an arbitrarily high degree of precision when the inputs are themselves not that precise. To quote him:
Excessive precision is not the mark of the expert. Nor is it the mark of the layman. It’s the mark of the intern.
He also mentions that it is typically difficult to assess the precision of the output even if we know the precision of the input. This is in part a matter of possible nonlinearity (and quite possibly opaqueness, as in "black box") in the mechanism that transforms inputs into outputs. It can also be caused by the inherent imprecision of floating point arithmetic (rounding error, truncation error, spiteful quantum effects, ...).

In operations research, there are myriad other sources of imprecision. Your conceptual model of the system is an approximation of the actual system. You may have linearized things that are not linear, or used "convenient" nonlinear representations (polynomials, exponential functions) for things that are nonlinear in a goofy way. If you are like me, you will have ignored randomness entirely, because the work is hard enough even when you pretend determinism. (Also, I confuse easily.) If you did bake in some explicit consideration of randomness, you likely approximated that. (How many things in life are really normally distributed?) There's also the matter of the passage of time. Did you make all the coefficients, distributions etc. time-varying? Didn't think so (and don't really blame you). Throw in estimation error for the model parameters and you have a witches' brew of imprecision. (It's almost Halloween, so I had to work in at least one spooky metaphor.)

This brings to mind two running questions that I encounter with discrete optimization models. I doubt either has a definitive answer. The first question is whether there is any benefit to running the solver all the way to "proven optimality" (gap nearly zero) if everything is so approximate. My personal preference is to do it when it doesn't take too long (why not?) but not bother if the gap is closing at a painfully slow rate.

The second question is whether to bother using a MIP model and solver at all, or just run some metaheuristic (preferably one with a cool biologically-inspired name, such as "banana slug foraging optimization"). After all, if you are just getting an answer to an approximation of the actual problem, why not get an approximate answer to the approximation? My personal inclination is to solve the model exactly when possible, in the hope that the model is at least "close" to reality and thus an optimal solution to the model will be "close" to an optimal solution to the real problem. At the same time, I recognize that there is a lot of hoping going on there, so if getting an exact solution is painful, I'll switch to a heuristic or metaheuristic and hope that the answer I get proves decent in practice.

Either way, I'm not going to claim the "optimal" value of some variable is 2.31785 with any degree of certainty. It's about 2.3 (if we're lucky). On the bright side, a lot of the problems I deal with have binary variables. There's not a lot of concern there about artificial precision; the only concern is artificial confidence in the solution.

Friday, September 21, 2018

Coordinating Variable Signs

Someone asked me today (or yesterday, depending on whose time zone you go by) how to force a group of variables in an optimization model to take the same sign (all nonpositive or all nonnegative). Assuming that all the variables are bounded, you just need one new binary variable and a few constraints.

Assume that the variables in question are $x_1,\dots,x_n$ and that they are all bounded, say $L_i \le x_i \le U_i$ for $i=1,\dots,n$. If we are going to allow variables to be either positive or negative, then clearly we need $L_i < 0 < U_i$. We introduce a new binary variable $y$ and, for each $i$, the constraints$$L_i (1-y) \le x_i \le U_i y.$$If $y$ takes the value 0, every original variable must be between its lower bound and 0 (so nonpositive). If $y$ takes the value 1, every original variable must be between 0 and its upper bound (so nonnegative).

Note that trying to enforce strictly positive or strictly negative rather than nonnegative or nonpositive is problematic, since optimization models abhor strict inequalities. The only work around I know is to change "strictly positive" to "greater than or equal to $\epsilon$" for some strictly positive $\epsilon$, which creates holes in the domains of the variables (making values between 0 and $\epsilon$ infeasible).

Wednesday, September 12, 2018

Choosing "Big M" Values

I seem to bring up "big M" models a lot, so apologies if I end up repeating myself in places here. Not long ago, someone passed along highlights of a "big M" type model to me and asked if he could somehow reformulate to get rid of $M$. I did not see any good way to do it, but I did offer up suggestions about choosing values for $M$, and I thought that might make a decent blog post.

Just for clarity, what I'm referring to is an integer or mixed-integer program (I'll stick to linear objective and constraints) in which binary variables are used to, in loose terms, turn constraints on or off. So a representative constraint might look like the following:$$\sum_i a_i x_i \le b + M(1-y)$$with the $a_i$ coefficients, the $x_i$ variables (discrete, continuous or a mix), $y$ a binary variable and $M$ a "large" coefficient. Think of $M$ as a stunt double for infinity. The notion is that if $y=1$, the constraint is "turned on" and reduces to$$\sum_i a_i x_i \le b.$$If $y=0$, the constraint is "turned off" and reduces to$$\sum_i a_i x_i \le b + M \approx \infty,$$which should be satisfied by any choices for $x$ the solver might make. There are other variations of "big M" constraints, but I'll stick with this one for illustrative purposes.

The perils of $M$

Choosing a value for $M$ can be a tricky business. Suppose first that we choose $M$ small enough that, when $y=0$ and the constraint should be "off",$$\sum_i a_i x_i^* \gt b + M$$for some solution $x^*$ that should be optimal but now appears infeasible to the solver. The result is an incorrect solution. So let's refer to a value for $M$ as correct if it is large enough that no potentially optimal solution violates the constraint when $y=0$. ("Correct" is my term for this. I don't know if there is an "official" term.) Correctness is essential: choose an incorrect (too small) value for $M$ and you risk getting an incorrect solution.

Since it is frequently not obvious how large $M$ needs to be in order to be guaranteed correct, people tend to err on the side of caution by choosing really massive values for $M$. That brings with it a different set of problems (ones often ignored in introductory text books). First, branch-and-bound (or branch-and-cut) solvers tend to rely on the continuous relaxation of subproblems (dropping integrality constraints but keeping the functional constraints). Large values of $M$ make for weak relaxations (producing very loose bounds).

Suppose, for instance, that we are solving a model that both builds roads and routes traffic along those roads. $y$ represents the decision to build a particular road ($y=1$) or not ($y=0$). If we build the road, a cost $c$ is incurred (represented by the term $cy$ in the objective function) and we can send unlimited traffic along it; if not, there is no cost but no traffic is committed. In our constraint, the left side is traffic on the road and $b=0$, so traffic there can either be up to $M$ (if built) or 0 (if not built). Now suppose, for example, that the user chooses $M=10^9$ and the solver, in a continuous relaxation of some subproblem, sets $y=10^{-6}$. The solution to the relaxed problem pays only one millionth of $c$ for the privilege of allowing 1,000 units of traffic on this route, which basically allows it to "cheat" (and perhaps grossly underestimates the cost of any actual solution).

A related problem is limited arithmetic precision. Since double-precision floating-point arithmetic (the way the computer does arithmetic with real numbers) has limited precision, and is thus susceptible to rounding errors, solvers have to establish standards for "close enough to feasible" and "close enough to integer". Continuing the previous example, it is entirely possible that the solver might look at $y=10^{-6}$, decide that is within integrality tolerance, and treat it as $y=0$, possibly leading to what it thinks is an "optimal" solution with 1,000 units of traffic running over a road that does not exist. Oops.

Finally, note that $M$ is part of the coefficient matrix. Mixing very large coefficients (like $M$) with much smaller coefficients can create numerical instability, leading the solver to spend more time computing linear program pivots and possibly leading to totally erroneous solutions. That's too big a topic to get into here, but I'm pretty sure I've mentioned it elsewhere.

Despite all this, "big M" models are still very common in practice. There is a nice paper by Codato and Fischetti [1] that shows how a form of Bender's decomposition can be used to get rid of the $M$ coefficients. I've used it successfully (for instance, in [2]), but Bender's decomposition is an advanced technique (i.e., not for the faint of heart), and is not always supported by high-level modeling languages.

So, if we are stuck with "big M" models, what can we do to choose values of $M$ that are both correct and and tight (again, my term), meaning small enough that they avoid numerical problems and hopefully produce relaxations with bounds strong enough to do some good?

Not all $M$ are created equal

Most "big M" models have more than one constraint (and frequently many) containing a large coefficient $M$. One of my pet peeves is that authors of text books and journal articles will frequently just us $M$, with no subscripts, everywhere such a coefficient is needed. This, in turn, breeds a tendency for modelers to choose one large value for $M$ and use it everywhere.

Back when manuscripts were produced on typewriters, it was a bit of a pain in the ass to put in subscripts, so I can see how the trend would have started. (Quick tangent: Nate Brixius recently did a blog post with a picture of a typewriter, for those too young to be familiar with them. I'll link it here for the benefit of younger readers ... and also note that the one pictured is much more modern than the one I learned on, which was not electric.) Today, when people routinely use LaTeX to write manuscripts, there's not much excuse for omitting one measly subscript here and there. Anyway, my point is that it is usually better to use a different, customized value of $M$ for each constraint that needs one.

Customized how?

In some cases, model context will provide you an obvious choice for $M$. For example, suppose you are selecting warehouse locations and planning shipments from warehouses to customers. A typical "big M" constraint will look like the following (slightly different from our previous constraint but clearly related:$$\sum_j x_{ij} \le M_i y_i.$$Here variable $x_{ij}$ indicates the amount shipped from warehouse $i$ to customer $j$, binary variable $y_i$ is 1 if we use that warehouse and 0 if we do not, and the intent of the constraint is to say that if we do not use (and pay for) a warehouse, we cannot ship anything from it. One obvious choice for $M_i$ is the capacity of the warehouse. A better choice might be the smaller of that capacity or the maximum volume of customers demands that might plausibly be assigned to that warehouse. The latter might be based, for example, on knowing that the company would not ship anything to customers outside a certain distance from the warehouse.

In other cases, there may be some less obvious way to extract suitable (correct and hopefully tight) values for the $M_i$. A while back, I was working on mixed-integer programming models for two group discriminant analysis, and in one paper ([3]) I needed to pick values for $M$. Without going into gory details, I was able to normalize the coefficients of the (linear) discriminant function under construction and then compute valid coefficients $M_i$ by looking at the euclidean distances between pairs of observations. I don't claim that the $M_i$ I came up with were the tightest possible, but they were correct and they produced faster solution of the models than what I got with some arbitrarily large values I initially tried.

Finally, you can actually solve subproblems to get correct and (hopefully) tight $M$ values. Whether this is feasible depends on how many you need and how large and scary your model is. Going back to my first example, the trick would work something like this. Relax all integrality constraints. Drop the constraint in question. If you have any other "big M" constraints for which you have not yet computed tight values of $M$, pick something safely large for those coefficients, trying to avoid numerical problems but not worrying about tightness. Now switch the objective to maximizing $\sum_i a_i x_i - b$. The optimal objective value is your value for $M$. It is clearly correct: any feasible solution to the MIP model is a feasible solution to the LP relaxation, and so the value of $\sum_i a_i x_i - b$ cannot exceed your choice of $M$ (regardless of whether $y$ is 0 or 1). Repeat for each additional constraint containing a "big M" coefficient (switching from minimizing to maximizing if the nature of the constraint warrants it).

The process may be time-consuming, but at least you are solving LPs rather than MIPs. It is also a bit trickier than I made it sound, at least when multiple $M_i$ are involved. You have to guess values for any $M_i$ for which you have not yet solved, and guessing big values for them may result in a looser relaxation than necessary, which in turn may result in an inflated value for the $M_i$ you are currently choosing. You should definitely get correct choices for the $M$ coefficients; it's just the tightness that is in question. Even so, the values you get for the $M_i$ just might be tight enough to save you more time in the MIP solution than it costs to solve the LPs.


[1] Codato, G. and Fischetti, M. Combinatorial Benders' Cuts for Mixed-Integer Linear Programming. Operations Research, 2006, Vol. 54(4), pp. 756-766.

[2] Bai, L. and Rubin, P. A. Combinatorial Benders Cuts for the Minimum Tollbooth Problem. Operations Research, 2009, Vol. 57(6), pp. 1510-1522.

[3] Rubin, P. A. Heuristic Solution Procedures for a Mixed-Integer Programming Discriminant Model. Managerial and Decision Economics, 1990, Vol. 11(4), pp. 255-266.

Saturday, August 25, 2018

Adding Items to a Sequence

A question posed on OR-Exchange in 2017 asked the following: Given a tour of nodes, how does one best add two new nodes while respecting the ordering of the original tour. Specifically, the author began with a tour 0 - 1 - 2 - 4 - 6 - 0 (where node 0 is a depot) and wanted to add new stops 3 and 5 in such away that, in the revised tour, stop 1 still came before stop 2, stop 2 before stop 4, etc.

This problem can arise not just in vehicle routing but in many sorts of sequencing problems (such as scheduling jobs for production). Of course, preserving the original ordering to the extent possible is not always a concern, but it might be if, for instance, the existing stops are customers who have been promised somewhat general time windows for delivery. In any event, we'll just take the question as a given.

The answer I posted on OR-X made the somewhat charitable (and, in hindsight, unwarranted) assumption that the two new stops would be inserted by breaking two previous arcs, rather than consecutively (for instance, ... - 2 - 3 - 5 - 4 - ...). So I'll post an answer without that assumption here. In fact, I'll post three variants, one specific to the case of adding exactly two stops and the other two more general.

First, let me articulate some common elements. I'll denote the set of original nodes by $N_1$, the set of nodes to be added by $N_2$, and their union by $N=N_1 \cup N_2$. All three approaches will involve setting up integer programming models that will look for the most part like familiar routing models. So we will have binary variables $x_{ij}$ that will take the value 1 if $j$ immediately follows $i$ in the new tour. We will have constraints ensuring that every node is entered and exited exactly once:$$\sum_{j\in N} x_{ij} = 1\quad \forall i\in N\\ \sum_{i \in N} x_{ij} = 1 \quad\forall j\in N.$$The objective function will be some linear combination of the variables (sum of distances covered, sum of travel times, ...), which I will not worry about here, since it is no different from any sequencing model.

The first new wrinkle is that we do not define a variable for every pair of nodes. We create $x_{ij}$ only for the following combinations of subscripts: \begin{align*} i & \in N_{2},j\in N_{2},i\neq j\\ i & \in N_{1},j\in N_{2}\\ i & \in N_{2},j\in N_{1}\\ i & \in N_{1},j\in N_{1},(i,j)\in T \end{align*} where $T$ is the original tour. Thus, for example, we would have $x_{24}$ but not $x_{42}$, nor $x_{26}$. The rationale is straightforward: if we add an arc between two original nodes that were not successors on the original tour, we will force an order reversal. For instance, suppose we replace the arc 2 - 4 with, say, 2 - 6. Node 4 now must appear either before node 2 or after node 6, and either way the order has not been preserved.

Version 1

The first variant makes explicit use of the fact that we have only two new nodes. We add one subtour elimination constraint, to prevent the new nodes from forming a subtour: $x_{35}+x_{53}\le 1.$ Now consider how many different ways we could insert the two new nodes. First, we could break two links in the original tour, inserting 3 in the void where the first link was and 5 in the void where the second link was. Since the original tour had five links there are $\binom{5}{2}=10$ distinct ways to do this. Similarly, we could break two links but insert 5 first and 3 later. There are again ten ways to do it. Finally, we could break one link and insert either 3 - 5 or 5 - 3 into the void. With five choices of the link to break and two possible orders, we get another ten results, for a grand total of 30 possibly new tours.

With that in mind, consider what happens if node 3 is inserted after original node $i$, breaking the link between $i$ and its original successor $j$. (In our model, this corresponds to $x_{i3}=1$.) If this is a single node insertion, then we should have $j$ follow node 3 ($x_{3j}=1$). If it is a double insertion ($i$ - 3 - 5 - $j$), we should have $x_{35}=x_{5j}=1$. We can capture that logic with a pair of constraints for each original arc:
\[ \left.\begin{aligned}x_{i3}-x_{3j} & \le x_{35}\\ x_{i3}-x_{3j} & \le x_{5j} \end{aligned} \right\} \forall(i,j)\in T. \] We could do the same using node 5 in place of node 3, but it is unnecessary. If node 3 is correctly inserted by itself, say between $i$ and $j$, and node 5 is inserted after original node $h$, then the original successor $k$ of $h$ needs a new predecessor. That predecessor cannot be $h$, nor can it be any other original node (given our reduced set of variables), nor can it be node 3 (which now precedes $j$). The only available predecessor is 5, giving us $h$ - 5 - $k$ as expected.

You might wonder how this accommodates a 5 - 3 insertion, say after node $i$. The original successor $j$ of $i$ needs a new predecessor, and 3 is the only eligible choice, so we're good.

I tested this with a small Java program, and it did in fact find all 30 valid revised tours (and no invalid ones).

Version 2

Version 2, which can be applied to scenarios with any number of new nodes, involves building a standard sequencing model with subtour elimination constraints. The only novel element is the reduced set of variables (as described above). A blog is no place to explain sequencing models in their full glory, so I'll just assume that you, the poor suffering reader, already know how.

Version 3

In version 3, we again build a sequencing model with the reduced set of variables, but this time we use the Miller-Tucker-Zemlin method of eliminating subtours rather than adding a gaggle of subtour elimination constraints. The MTZ approach generally results in smaller models (since the number of subtours, and hence the potential number of subtour constraints, grows combinatorially with the number of nodes), but also generally produces weaker relaxations.

The Wikipedia page for the TSP shows the MTZ constraints, although for some reason without labeling them as such. Assume a total of $n$ nodes (with consecutive indices), with node $0$ being the depot. The MTZ approach adds continuous variables $u_i, \,i\in \{1,\dots,n\}$ with bounds $0\le u_i \le n-1$. It also adds the following constraints for all eligible arcs $(i,j)$ with $i\neq 0$:$$u_i - u_j + n x_{ij} \le n-1.$$You can think of the $u_i$ variables as counters. The MTZ constraints say that if we go from any node $i$ (other than the depot) to any node $j$ (including the depot), the count at node $j$ has to be at least one larger than the count at node $i$. These constraints preclude any subtours, since a subtour (one starting and ending any place other than the depot) would result in the count at the first node of the subtour being larger than itself.

As I mentioned, the MTZ formulation has a somewhat weaker LP relaxation than a formulation with explicit subtour elimination constraints, so it is not favored by everyone. In our particular circumstance, however, it has an additional virtue: it gives us a relatively painless way to enforce the order preservation requirement. All we need do is insert constraints of the form$$u_j \ge u_i + 1\quad\forall (i,j)\in T.$$This forces the counts at the original nodes to increase monotonically with the original tour order, without directly impacting the counts at the new nodes.

Tuesday, July 31, 2018

NP Confusion

I just finished reading a somewhat provocative article on the CIO website, titled "10 reasons to ignore computer science degrees" (when hiring programmers). While I'm not in the business of hiring coders (although I recent was hired as a "student programmer" on a grant -- the Universe has a sense of humor), I find myself suspecting that the author is right about a few points, overstating a few and making a few that are valid for some university CS programs but not for all (or possibly most). At any rate, that's not why I mention it here. What particularly caught my eye was the following paragraph:
It’s rare for a CS major to graduate without getting a healthy dose of NP-completeness and Turing machines, two beautiful areas of theory that would be enjoyable if they didn’t end up creating bad instincts. One biologist asked me to solve a problem in DNA sequence matching and I came back to him with the claim that it was NP-complete, a class of problems that can take a very long time to solve. He didn’t care. He needed to solve it anyway. And it turns out that most NP-complete problems are pretty easy to solve most of the time. There are just a few pathological instances that gum up our algorithms. But theoreticians are obsessed with the thin set that confound the simple algorithms, despite being rarely observed in everyday life.
Any of my three regular readers will know that I periodically obsess/carp about NP-fixation, so I'm sympathetic to the tenor of this. At the same time, I have a somewhat mixed reaction to it.
  • "... NP-complete, a class of problems that can take a very long time to solve." This is certainly factually correct, and the author thankfully said "can" rather than "will". One thing that concerns me in general, though, is that not everyone grasps that problems in class P, for which polynomial time algorithms are known, can also take a very long time to solve. One reason, of course, is that "polynomial time" means run time is a polynomial function of problem size, and big instances will take longer. Another is that $p(n)=n^{1000}$ is a polynomial ... just not one you want to see as a (possibly attainable) bound for solution time. There's a third factor, though, that I think many people miss: the size of the coefficients (including a constant term, if any) in the polynomial bound for run time. I was recently reading a description of the default sorting algorithm in a common programming language. It might have been the one used in the Java collections package, but don't quote me on that. At any rate, they actually use two different sorting algorithms, one for small collections (I think the size cutoff might have been around 47) and the other for larger collections. The second algorithm has better computational complexity, but each step requires a bit more work and/or the setup is slower, so for small collections the nominally more complex algorithm is actually faster.
  • "He didn’t care. He needed to solve it anyway." I love this. It's certainly true that users can ask coders (and modelers) for the impossible, and then get snippy when they can't have it, but I do think that mathematicians (and, apparently, computer scientists) can get a bit too locked into theory. <major digression> As a grad student in math, I took a course or two in ordinary differential equations (ODEs), where I got a taste for the differences between mathematicians and engineers. Hand a mathematician an ODE, and he first tries to prove that it has a solution, then tries to characterize conditions under which the solution is unique, then worries about stability of the solution under changes in initial conditions or small perturbations in the coefficients, etc., ad nauseum. An engineer, faced with the same equation, tries to solve it. If she finds the solution, then obviously one exists. Depending on the nature of the underlying problem, she may or may not care about the existence of multiple solutions, and probably is not too concerned about stability given changes in the parameters (and maybe not concerned about changes in the initial conditions, if she is facing one specific set of initial conditions). If she can't solve the ODE, it won't do her much good to know whether a solution exists or not.</major digression> At any rate, when it comes to optimization problems, I'm a big believer in trying a few things before surrendering (and trying a few optimization approaches before saying "oh, it's NP-hard, we'll have to use my favorite metaheuristic").
  • "And it turns out that most NP-complete problems are pretty easy to solve most of the time. There are just a few pathological instances that gum up our algorithms." I find this part a bit misleading. Yes, some NP-complete problems can seem easier to solve than others, but the fundamental issue with NP-completeness or NP-hardness is problem dimension. Small instances of problem X are typically easier to solve than larger instances of problem X (although occasionally the Universe will mess with you on this, just to keep you on your toes). Small instances of problem X are likely easier to solve than large instances of problem Y, even if Y seems the "easier" problem class. Secondarily, the state of algorithm development plays a role. Some NP-complete problem classes have received more study than others, and so we have better tools for them. Bill Cook has a TSP application for the iPhone that can solve what I (a child of the first mainframe era) would consider to be insanely big instances of the traveling salesman problem in minutes. So, bottom line, I don't think a "few pathological instances" are responsible for "gum[ming] up our algorithms". Some people have problem instances of a dimension that is easily, or at least fairly easily, handled. Others may have instances (with genuine real-world application) that are too big for our current hardware and software to handle. That's also true of problems in class P. It's just that nobody ever throws their hands up in the air and quits without trying because a problem belongs to class P.
In the end, though, the article got me to wondering two things: how often are problems left unsolved (or solved heuristically, with acceptance of a suboptimal final solution), due to fear of NP-completeness; and (assuming that's an actual concern), would we be better off if we never taught students (other than those in doctoral programs destined to be professors) about P v. NP, so that the applications programmers and OR analysts would tackle the problems unafraid?

Tuesday, July 17, 2018

Selecting Box Sizes

Someone posted an interesting question about box sizes on Mathematics Stack Exchange. He (well, his girlfriend to be precise) has a set of historical documents that need to be preserved in boxes (apparently using a separate box for each document). He wants to find a solution that minimizes the total surface area of the boxes used, so as to minimize waste. The documents are square (I'll take his word for that) with dimensions given in millimeters.

To start, we can make a few simplifying assumptions.
  • The height of a box is not given, so we'll assume it is zero, and only consider the top and bottom surfaces of the box. For a box (really, envelope) with side $s$, that makes the total area $2s^2$. If the boxes have uniform height $h$, the area changes to $2s^2 + 4hs$, but the model and algorithm I'll pose are unaffected.
  • We'll charitably assume that a document with side $s$ fits in a box with side $s$. In practice, of course, you'd like the box to be at least slightly bigger, so that the document goes in and out with reasonable effort. Again, I'll let the user tweak the size formula while asserting that the model and algorithm work well regardless.
The problem also has three obvious properties.
  • Only document sizes need be considered as box sizes, i.e. for every selected size at least one document should fit "snugly".
  • The number of boxes you need at each selected size equals the number of documents too large to fit in a box of the next smaller selected size but capable of fitting in a box of this size.
  • You have to select the largest possible box size (since that is required to store the largest of the documents).
What interests me about this problem is that it can be a useful example of Maslow's Hammer: if all you have is a hammer, every problem looks like a nail. As an operations researcher (and, more specifically, practitioner of discrete optimization) it is natural to hear the problem and think in terms of general integer variables (number of boxes of each size), binary variables (is each possible box size used or not), assignment variables (mapping document sizes to box sizes) and so on. OR consultant and fellow OR blogger Erwin Kalvelagen did a blog post on this problem, laying out several LP and IP formulations, including a network model. I do recommend your reading it and contrasting it to what follows.

The first thought that crossed my mind was the possibility of solving the problem by brute force. The author of the original question supplied a data file with document dimensions. There are 1166 documents, with 384 distinct sizes. So the brute force approach would be to look at all $\binom{383}{2} = 73,153$ or $\binom{383}{3} = 9,290,431$ combinations of box sizes (in addition to the largest size), calculate the number of boxes of each size and their combined areas, and then choose the combination with the lowest total. On a decent PC, I'm pretty sure cranking through even 9 million plus combinations will only need a tolerable amount of time.

A slightly more sophisticated approach is to view the problem through the lens of a layered network. There are either three or four layers, representing progressively larger selected box sizes, plus a "layer 0" containing a start node. In the three or four layers other than "layer 0", you put one node for each possible box size, with the following restrictions:
  • the last layer contains only a single node, representing the largest possible box, since you know you are going to have to choose that size;
  • the smallest node in each layer is omitted from the following layer (since layers go in increasing size order); and
  • the largest node in each layer is omitted from the preceding layer (for the same reason).
Other than the last layer (and the zero-th one), the layers here will contain 381 nodes each if you allow four box sizes and 382 if you allow three box sizes. An arc connects the start node to every node in the first layer, and an arc connects every node (except the node in the last layer) to every node in the next higher layer where the head node represents a larger size box than the tail node. The cost of each arc is the surface area for a box whose size is given by the head node, multiplied by the number of documents too large to fit in a box given by the tail node but small enough to fit in a box given by the head node.

I wanted to confirm that the problem is solvable without special purpose software, so I coded it in Java 8. Although there are plenty of high quality open-source graph packages for Java, I wrote my own node, arc and network classes and my own coding of Dijkstra's shortest path algorithm just to prove a point about not needing external software. You are welcome to grab the source code (including the file of document sizes) from my Git repository if you like.

I ran both the three and four size cases and confirmed that my solutions had the same total surface areas that Erwin got, other than a factor of two (I count both top and bottom; he apparently counts just one of them). How long does it take to solve the problem using Dijkstra's algorithm? Including the time reading the data, the four box version takes about half a second on my decent but not workstation caliber PC. The three box version takes about 0.3 seconds, but of course gives a worse solution (since it is more tightly constrained). This is single-threaded, by the way. Both problem set up and Dijkstra's method are amenable to parallel threading, but that would be overkill given the run times.

So is it wrong to take a fancier modeling approach, along the lines of what Erwin did? Not at all. There are just trade-offs. The modeling approach produces more maintainable code (in the form of mathematical models, using some modeling language like GAMS or AMPL) that are also more easily modified if the use case changes. The brute force and basic network approaches I tried requires no extra software (so no need to pay for it, no need to learn it, ...) and works pretty well for a "one-off" situation where maintainability is not critical.

Mainly, though, I just wanted to make a point that we should not overlook simple (or even brute force) solutions to problems when the problem dimensions are small enough to make them practical ... especially with computers getting more and more powerful each year.

Friday, July 6, 2018

Mint 19 Upgrade: Adventures #1-3

I use my laptop as the "canary in the coal mine" when it comes to do operating system upgrades, since there's nothing awesomely important on it. So today I tried upgrading from Linux Mint 18.3 to 19.0. Note that I used the upgrade path, rather than downloading the installer, burning it to a bootable disk, then installing from there. In hindsight, that might have been the faster approach. The upgrade took over an hour, and that's before any debugging.

The case of the not-so-missing library file

I hit the first of what will no doubt be several adventures when I reinstalled RStudio desktop and discovered it would not run. Despite the installer saying that all dependencies were satisfied, when I tried to run it from a command line I was told that a library file ( could not be found.

I'll skip over another hour or so of pointless flailing and cut to the chase scene. It turns out that actually was installed on my laptop, as part of the libgl1-mesa-glx package. It was hiding in plain sight in /usr/lib/x86_64-linux-gnu/mesa/. Somehow, that folder had not made it onto the system library path. (I have no idea why.) So I ran the command

sudo ldconfig /usr/lib/x86_64-linux-gnu/mesa

and that fixed the problem.

Editor? We don't need no stinkin' editor

Next up, I couldn't find a text editor! Note that LibreOffice was installed, and was the default program to open text (.txt) files. Huh?? Poking around, I found nano, but xed (the default text editor in Mint 18) and gedit (the previous default editor) were not installed (even though xed was present before the upgrade).

Fixing this was at least (to quote a math prof I had in grad school) "tedious but brutally straightforward". In the software manager, I installed xed ... and xreader, also MIA. For whatever reason, the other X-Apps (xviewer, xplayer and pix) were already installed (as they all should have been).

The mystery of the launcher that wouldn't launch

Mint has a utility (mintsources) that lets you manage the sources (repositories, PPAs etc.) that you use. There is an entry for it in the main menu, but clicking that entry failed to launch the source manager. On the other hand, running the command ("pkexec mintsources") from a terminal worked just fine.

I found the original desktop file at /usr/share/applications/mintsources.desktop (owned by root, with read and write permissions but not execute permission). After a bunch of messing around, I edited the menu entry through the menu editor (by right-clicking the menu entry and selecting "Edit properties"), changing "pkexec mintsources" to "gksudo mintsources". That creating another version at ~/.local/share/applications/mintsources.desktop. After right-clicking the main menu button and clicking "Reload plugins", the modified entry worked. I have no idea why that works but "pkexec mintsources" does not, even though it does from a terminal. I tried editing back to "pkexec", just in case the mere act of editing was what did the trick, but no joy there. So I edited back to "gksudo", which seems to be working ... for now ... until the gremlins return from their dinner break.

Update: No sooner did I publish this than I found another instance of the same problem. The driver manager would not launch from the main menu. I edited "pkexec" to "gksudo" for that one, and again it worked. I guess "pkexec" is somehow incompatible with the Mint menu (at least on my laptop).

I'll close for now with a link to "Solutions for 24 bugs in Linux Mint 19".

Thursday, July 5, 2018

Firefox Ate My Bookmarks

This morning, I "upgraded" Firefox 60.0.2 to 61.0.0 on my desktop computer (Linux Mint). When I started the new version, it came to life with the correct default tabs and pages, no menu bar (my reference), and with the bookmark tool bar visible ... but completely empty. Toggling the menu option to display it was unproductive. I restored the most recent backup of the bookmarks, but the tool bar remained empty.

So I killed Firefox, started it in safe mode (no change), then killed it again and restarted it normally. This time the bookmark tool bar was populated with the correct bookmarks and folders. (I don't know if passing through safe mode was necessary. Maybe it just needed another restart after the restoration operation.) Unfortunately, my problems were not yet over. Although I had the correct top-level stuff in the bookmark tool bar, the various folders only had about three items each, regardless of how many were supposed to be in each folder (and, trust me, it was typically more than three).

When you go to restore bookmarks in Firefox, it will show you a list of backup files (I think it keeps the fifteen most recent) and how many items each contains. My recent backups were all listed with 18 to 21 items. Fortunately, I also have Firefox (not yet upgraded) on my laptop (running the same version of Linux Mint), with the same bookmarks. On the laptop, recent backups have 444 items. So either the upgrade messed up the backup files or Firefox 61.0.0 has trouble reading backups from version 60. Heck, maybe Firefox 60 screwed up making the automatic backups on my desktop (but, somehow, not on my laptop).

The laptop proved my savior. I manually backed up the bookmarks on it to a file, parked that file on Dropbox just in case, copied it to the desktop and manually restored it. For the moment, at least, I have all my bookmarks back.

In case you're reading this because you're in the same boat, here are the steps to do manual backups. Of course, this will only help if you have your bookmarks intact somewhere. If you're thinking of upgrading Firefox but haven't pulled the trigger yet, you might want to make a manual backup for insurance.

Start with the "hamburger menu" (the button three horizontal parallel lines). From there, click the "Library" option, then "Bookmarks", then "Show All Bookmarks" (at the very bottom). That opens up a window titled "Library". Click the "Import and Backup" drop-down menu, then either "Backup" or "Restore" depending on your intent. Backup will give you a typical file saving dialog. Restore will give you a list of your recent backups and an option at the bottom to select a file. Use that option to navigate to a manual backup.

Once again, software saves me from having a productive morning. :-(

By the way, this bug has already been reported:

Tuesday, July 3, 2018

Usefulness of Computer Science: An Example

I thought I would follow up on my June 29 post, "Does Computer Science Help with OR?", by giving a quick example of how exposure to fundamentals of computer science recently helped me.

A current research project involves optimization models containing large numbers of what are basically set covering constraints, constraints of the form $$\sum_{i\in S} x_i \ge 1,$$ where the $x_i$ are binary variables and $S$ is some subset of the set of all possible indices. The constraints are generated on the fly (exactly how is irrelevant here).

In some cases, the same constraint may be generated more than once, since portions of the code run in parallel threads. Duplicates need to be weeded out before the constraints are added to the main integer programming model. Also, redundant constraints may be generated. By that, I mean we may have two cover constraints, summing over sets $S_1$ and $S_2$, where $S_1 \subset S_2$. When that happens, the first constraint implies the second one, so the second (weaker) constraint is redundant and should be dropped.

So there comes a "moment of reckoning" where all the constraints generated by all those parallel threads get tossed together, and duplicate or redundant ones need to be weeded out. That turns out to be a rather tedious, time-consuming operation, which brings me to how the constraints are represented. I'm coding in Java, which has various implementations of a Set interface to represent sets. The coding path of least resistance would be to toss the indices for each constraint into some class implementing that interface (I generally gravitate to HashSet). The Set interface defines an equals() method to test for equality and a containsAll() method to test whether another set is a subset of a given set. So this would be pretty straightforward to code.

The catch lies in performance. I have not found it documented anywhere, but I suspect that adding elements to a HashSet is $O(n)$ while checking subset status or equality is $O(n^2)$, where $n$ is the number of possible objects (indices). The reason I say $O(n^2)$ for the latter two operations is that, in the worst case, I suspect that Java takes each object from one subset and compares it to every object in the other set until it finds a match or runs out of things to which to compare. That means potentially $O(n)$ comparisons for each of $O(n)$ elements of the first set, getting us to $O(n^2)$.

A while back, I took the excellent (and free) online course "Algorithms, Part 1", offered by a couple of faculty from my alma mater Princeton University. I believe it was Robert Sedgewick who said at one point (and I'm paraphrasing here) that sorting is cheap, so if you have any inkling it might help, do it. The binary variables in my model represent selection or non-selection of a particular type of object, and I assigned a complete ordering to them in my code. By "complete ordering" I mean that, given two objects $i$ and $j$, I can tell (in constant time) which one is "preferable". Again, the details do not matter, nor does the plausibility (or implausibility) of the order I made up. It just matters that things are ordered.

So rather than just dump subscripts into HashSets, I created a custom class that stores them in a TreeSet, a type of Java set that maintains sort order using the ordering I created. The custom class also provides some useful functions. One of those functions is isSubsetOf(), which does pretty much what it sounds like: A.isSubsetOf(B) returns true if set $A$ is a subset of set $B$ and false if not.

In the isSubsetOf() method, I start with what are called iterators for the two sets $A$ and $B.$ Each starts out pointing to the smallest member of its set, "smallest" defined according to the ordering I specified. If the smallest member of $B$ is bigger than the smallest member of $A$, then the first element of $A$ cannot belong to $B$, and we have our answer: $A\not\subseteq B$. If the smallest element of $B$ is smaller than the smallest element of $A$, I iterate through element of $B$ until either I find a match to the smallest element of $A$ or run out of elements of $B$ (in which case, again, $A\not\subseteq B$). Suppose I do find a match. I bump the iterator for $A$ to find the second smallest element of $A$, then iterate through subsequent members of $B$ (picking up where I left off in $B$, which is important) until, again, I get a match or die trying. I keep doing this until I get an answer or run out of elements of $A$. At that point, I know that $A\subseteq B$.

What's the payoff for all this extra work? Since I look at each element of $A$ and each element of $B$ at most once, my isSubstOf() method requires $O(n)$ time, not $O(n^2)$ time. Using a TreeSet means the contents of each set have to be sorted at the time of creation, which is $O(n\log n)$, still better than $O(n^2)$. I actually did code it both ways (HashSet versus my custom class) and timed them on one or two moderately large instances. My way is in fact faster. Without having a bit of exposure to computer science (including the Princeton MOOC), though, it would never have occurred to me that I could speed up what was proving to be a bottleneck in my code.

Friday, June 29, 2018

Does Computer Science Help with OR?

Fair warning: tl/dr.

After reading a blog post yesterday by John D. Cook, "Does computer science help you program?", I decided to throw in my two cents (convert to euros at your own risk) on a related topic: does computer science (which I will extend to including programming) help you as an OR/IE/management science/analytics professional? What follows is a mix of opinion and experience, and hence cannot be tested rigorously. (In other words, it's a blog post.)


The obvious starting point is programming. Do you need to know how to program to work in OR (and related fields -- to cut down on typos, I'm going to use "OR" as an umbrella acronym)? In many (most?) cases, yes! Some people with boring jobs may only have to shove data into a statistics program with a graphical user interface and push buttons, or write optimization models in a high level language (technically a form of coding, but I'll discount that) and summon the optimizer, but many will have to do tasks either too complicated for high level development environments, too quirky, or just not suited to one. Unless you magically find yourself endowed with limitless programming support, sooner or later (most likely sooner) you are likely to need to do some coding. (Note to faculty: even if you have hot and cold running students, don't assume that means adequate programming support. I taught simulation coding to doctoral students in a previous life. One of them wrote a program that gave "spaghetti code" a whole new meaning. It looked like a street map of downtown Rome after an earthquake.)

I've done a ton of coding in my time, so perhaps I'm a bit biased. Still, I cannot recall the last OR project I did (research, application, teaching tool or whatever) that did not involve significant coding.


John D. Cook once did a blog post (I don't have the link at hand) about how many languages an analyst or applied mathematician needed to know. I forget the details, but the gist was "one for every type of programming task you do". So here's my toolkit:
  • one scripting language, for quick or one-off tasks (R for me; others may prefer Python or whatever);
  • one language suited for data manipulation/analysis/graphic (R again for me);
  • one language for "substantial" computational tasks (Java for me, C++ for a lot of people, Julia for some recent adopters, MATLAB or equivalent for some people, ...); and
  • one language for dealing with databases (SQL for me, although "SQL" is like saying "Indian" ... there are a lot of "dialects").
In case you're wondering how the first two bullets differ (since I use R for both), here's a recent example of the first bullet. A coauthor-to-be and I received a data set from the author of an earlier paper. One file was a MATLAB script with data embedded, and another looked like the output of a MATLAB script. We needed to extract the parts relevant to our work from both, smush them together in a reasonable way, and reformulate the useful parts to the file format our program expects. That does not exactly call for an object-oriented program, and using a script allowed me to test individual bits and see if they did what I expected (which was not always the case).

Parallel computation

I went a long time knowing hardly anything about this, because I was using earlier computing devices where parallelism was out of the question. Today, though, it is pretty easy to find yourself tackling problems where parallel threads or parallel processes are hard to avoid. This includes, but is not limited to, writing programs with a graphical user interface, where doing heavy number crunching in the main thread will freeze the GUI and seriously frustrate the user. I just finished (knock on virtual wood) a Java program for a research project. During the late stages, while tweaking some code related to a heuristic that runs parallel threads and also uses CPLEX, I managed to (a) fill up main memory once (resulting in disk thrashing and essentially paralyzing not only the program interface but the operating system interface ... meaning I couldn't even kill the program) and (b) crash the Java virtual machine (three times, actually). So, trust me, understanding parallel processing can really be important.

"Theoretical" Computer Science

This is what John Cook was after in his informal poll. Here, my answer is that some parts are very useful, others pretty much not useful at all, and maybe some stuff in between.

Data structures

As a graduate student (in math, on the quarter system), I took a three course sequence required for junior year computer science majors. One course concentrated on how to write operating systems. It's perhaps useful to have a basic knowledge of what an operating system does, but I'm pretty sure an OR person can get that from reading computer magazines or something, without taking a course in it. I've totally forgotten what another one of the courses covered, which suggests that maybe it was not crucial to my professional life.

The third course focused on data structures, and while much of what I learned has probably become obsolete (does anyone still use circular buffers?), it's useful both to know the basics and to understand some of the concerns related to different data structures. More than once, while hacking Java code, I've had to give some thought to whether I would be doing a lot of "give me element 7" (which favors array-like structures) versus "give me the smallest element" (favoring tree-like structures) versus "the same thing is going to get added a zillion times, and you only need to keep one copy" (set interfaces, built over who knows what kind of structure).


Another computer science topic you need to know is computational complexity, for a variety of reasons. The first is that, at least if you work in optimization, you will be bombarded by statements about how this problem or that is NP-ridiculous, or this algorithm or that is NP-annoying. (The latter is total b.s. -- NP-hardness is a property of the problem, not the algorithm. Nonetheless, I occasionally see statements like that.) It's important to know both what NP-hardness is (a sign that medium to large problem instances might get pretty annoying) and what it is not (a sign of the apocalypse, or an excuse for skipping the attempt to get an exact solution and immediately dragging out your favorite metaheuristic). You should also understand what a proof of NP-hardness entails, which is more than just saying "it's an integer program".

Beyond NP silliness, though, you need to understand what $O(\dots)$ complexity means, and in particular the fact that an $O(n^2)$ algorithm is slower than an $O(n \log(n))$ algorithm for large $n$, but possibly faster for small $n$. This can help in choosing alternative ways to do computational tasks in your own code.

Finite precision

This may turn up in a numerical analysis class in a mathematics program, or in a computer science course, but either way you really need to understand what rounding and truncation error are, how accurate double-precision floating point arithmetic is, etc. First, this will help you avoid embarrassment by asking questions like "Why does the solver say my 0-1 variable is 0.999999999999, which is clearly an error?". Second, it will give you an appreciation for why doing things with "stiff" matrices can create remarkably goofy results from very straightforward looking problems. That, in turn, will help you understand why "big M" methods are both a boon and a curse in optimization.

I may have more to say on this at a later date, but right now my computer is running out of electrons, so I'll quit here.

Tuesday, June 19, 2018

Callback Cuts That Repeat

The following post is specific to the CPLEX integer programming solver. I have no idea whether it applies to other solvers, or even which other solver have cut callbacks.

Every so often, a user will discover that a callback routine they wrote has "rediscovered" a cut it previously generated. This can be a bit concerning at first, for a few reasons: it seems implausible, and therefore raises concerns of a bug someplace; it represents repeated work, and thus wasted effort; and it suggests at least the possibility of getting stuck in a loop, the programmatic equivalent of "Groundhog Day". (Sorry, couldn't resist.) As it happens, though, repeating the same cut can legitimately happen (though hopefully not too often).

First, I should probably define what I mean by callbacks here (although I'm tempted to assume that if you don't know what I mean, you've already stopped reading). If you want to get your geek on, feel free to wade through the Wikipedia explanation of a callback function. In the context of CPLEX, what I'm referring to is a user-written function that CPLEX will call at certain points in the search process. I will focus on functions called when CPLEX is generating cuts to tighten the bound at a given node of the search tree (a "user cut callback" in their terminology) or when CPLEX thinks it has identified an integer-feasible solution better than the current incumbent (a "lazy constraint callback"). That terminology pertains to versions of CPLEX prior to 12.8, when those would be two separate (now "legacy") callbacks. As of CPLEX version 12.8, there is a single ("generic") type of callback, but generic callbacks continue to be called from multiple "contexts", including those two (solving a node LP, checking a new candidate solution).

The purpose here is to let the user either generate a bound-tightening constraint ("user cut") using particular knowledge of the problem structure, or to vet a candidate solution and, if unsuitable, issue a new constraint ("lazy constraint") that cuts off that solution, again based on problem information not part of the original IP model. The latter is particularly common with decomposition techniques such as Benders decomposition.

So why would the same user cut or lazy constraint be generated more than once (other than due to a bug)? There are at least two, and possibly three, explanations.

Local versus Global

A user cut can be either local or global. The difference is whether the cut is valid in the original model ("global") or whether it is contingent on branching decisions that led to the current node ("local"). The user can specify in the callback whether a new cut is global or local. If the cut is specified as local, it will be enforced only at descendants of the current node.

That said, it is possible that a local cut might be valid at more than one node of the search tree, in which case it might be rediscovered when those other nodes are visited. In the worst case, if the cut is actually globally valid but for some reason added with the local flag set, it may be rediscovered quite a few times.

Parallel Threading

On a system with multiple cores (or multiple processors), using parallel threads can speed CPLEX up. Parallel threads, though, are probably the most common cause for cuts and lazy constraints repeating.

The issue is that threads operate somewhat independently and only synchronize periodically. So suppose that thread A triggers a callback that generates a globally valid user cut or lazy constraint, which is immediately added to the problem A is solving. Thread B, which is working on a somewhat different problem (from a different part of the search tree), is unaware of the new cut/constraint until it reaches a synchronization point (and finds out what A and any other sibling threads have been up to). Prior to that, B might stumble on the same cut/constraint. Since A and B are calling the same user callback function, if the user is tracking what has been generated inside that function, the function will register a repetition. This is normal (and harmless).

Cut Tables

This last explanation is one I am not entirely sure about. When cuts and lazy constraints are added, CPLEX stores them internally in some sort of table. I believe that it is possible in some circumstances for a callback function to be called before the table is (fully) checked, in which case the callback might generate a cut or constraint that is already in the table. Since this deals with the internal workings of CPLEX (the secret sauce), I don't know first-hand if this is true or not ... but if it is, it is again slightly wasteful of time but generally harmless.

User Error

Of course, none of this precludes the possibility of a bug in the user's code. If, for example, the user reacts to a candidate solution with a lazy constraint that is intended to cut off that solution but does not (due to incorrect formulation or construction), CPLEX will register the user constraint, notice that the solution is still apparently valid, and give the callback another crack at it. (At least that is how it worked with legacy callbacks, and I think that is how it works with generics.) Seeing the same solution, the user callback might generate the same (incorrect) lazy constraints, and off the code goes, chasing its own tail.

Wednesday, May 30, 2018

Getting Videos to Play in Firefox

I think I've solved a lingering problem I've had playing certain videos in Firefox, and I'm posting the details here mostly so I don't forget them and partly in case anybody else is tripping over this.

I will periodically land on a web page with a video at the top. Usually I don't want to play the video, but sometimes I do ... in which case I have mixed results getting the video to play in Firefox. Typically Chrome has no such problem with the page, which makes sense given that the issue in Firefox relates to a couple of browser extensions I use. What made diagnosing the problem tricky was that (a) results were inconsistent (some videos played, some didn't) and (b) the problem resulted from a combination of two extensions, not just one particular one.

They symptom was that, when I clicked the play button on a video, the player box would turn into a black screen with a spinner that would spin indefinitely, until an error message popped up saying that the video player had hit a time limit waiting for the video to load. Again, this happened on some videos but not others. In particular, I never had a problem with a YouTube video. It was at least consistent in that a video that triggered the error would always trigger the error, regardless of page reloads etc.

The two Firefox extensions involved are HTTPS Everywhere (which tries to force page contents to load using the more secure HTTPS protocol than the ordinary HTTP protocol) and Disable HTML5 Autoplay (which prevents videos from automatically starting to play). The first extension is a security enhancement. (For an opinion piece on why HTTPS is important, read "Why HTTPS Matters".) Unfortunately, many web sites either do not deploy HTTPS for some content or screw up their site configuration. Regarding the second extension, I find video autoplay to be rather annoying, since it frequently involves ads or other videos that I have no interest in, and forces me to play whack-a-mole to shut them the bleep up.

I'm loathe to give up either extension, and fortunately I don't have to. What works for me is a combination of two tweaks. On a site where I get problem videos ( is the main source, in my case), I click the toolbar button for the autoplay extension, leave "Disable Autoplay" selected, but deselect "Disable Preloading". That only needs to be done once per site. With a page open containing a problem video, I then disable HTTPS Everywhere (again, by clicking its toolbar button and deselecting the first option). That should automatically cause the page to reload, and the video will play properly. After I'm done watching, I just reenable HTTPS Everywhere. This part has to be repeated for each page containing a video that will not load via HTTPS, but it's a price I'm willing to pay to preserve security.

Tuesday, May 15, 2018

Grouping Rows of a Matrix

I spent a large chunk of yesterday afternoon doing something I thought would be simple (relatively speaking) in LaTeX. I wanted to group rows of a matrix (actually, in my case, a vector) with right braces, and label the groups. An example of what I wanted is in the image below.

This seems to me to be a fairly common thing to do, and LaTeX has been around over 35 years (TeX even longer), so by now it must be pretty easy. Right? Um, not so much. I wore out Google looking for packages that would do this. Curiously, it's easy to put braces over and under things:
  • $\overbrace{[x_1,\dots,x_n]}$ [\overbrace{[x_1,\dots,x_n]}];
  • $\underbrace{[x_1,\dots,x_n]}$ [\underbrace{[x_1,\dots,x_n]}].
There are packages to let you surround matrices, arrays etc. with a variety of delimiters (not just parentheses or square brackets). Nowhere, though, could I find a command or package to do the above.

Fortunately, something pointed me in the direction of the PGF/TiKZ package, which I've used in the past for doing drawings. It's an incredible tool in terms of both what it can do and the outstanding quality of its manual. Because it does so many things, I've never really gotten to know all its capabilities, and in particular its ability to do matrices in a picture environment.

Here is the code to do my illustration. You need to load the TiKZ package and two of its libraries in your document preamble, as follows:

\usetikzlibrary{matrix, decorations.pathreplacing}

The code for the drawing is:

 \matrix (vec) [matrix of math nodes, left delimiter = {[}, right delimiter = {]}] {
f_1 \\
\vdots \\
f_{a} \\
f_{a + 1} \\
\vdots \\
f_{b} \\
f_{b + 1} \\
\vdots \\
f_{c} \\
\node (a) at (vec-1-1.north) [right=20pt]{};
\node (b) at (vec-3-1.south) [right=20pt]{};
\node (c) at (vec-4-1.north) [right=20pt]{};
\node (d) at (vec-6-1.south) [right=20pt]{};
\node (e) at (vec-7-1.north) [right=20pt]{};
\node (f) at (vec-9-1.south) [right=20pt]{};
\draw [decorate, decoration={brace, amplitude=10pt}] (a) -- (b) node[midway, right=10pt] {\footnotesize something};
\draw [decorate, decoration={brace, amplitude=10pt}] (c) -- (d) node[midway, right=10pt] {\footnotesize something else};
\draw [decorate, decoration={brace, amplitude=10pt}] (e) -- (f) node[midway, right=10pt] {\footnotesize something silly};

The name of the matrix ("vec") is arbitrary. The amplitude for the brace (10pt) and the offsets (10pt and 20pt) are matters of taste.

If you happen to know a faster way of doing this, please do share in a comment.