Terminology
First, I need to clarify terminology. Any reference to "big M" means the incorporation into a model of coefficients (usually represented by $M$) that are in a sense surrogates for $\infty$. The phrase "big M method" is frequently used to cover two somewhat different modeling techniques.- One eliminates the first phase (finding a feasible solution) of the two-phase simplex method by merging the phase I and phase II objective functions into a single function, penalizing the artificial variables with very large coefficients ($M$). It's an example of using a penalty function, and requires a sufficiently large value of $M$ to ensure that violating a constraint is never optimal.
- The other provides a way for binary variables to turn constraints on or off. Consider, for example, a lockbox problem in which the number $x_{ij}$ of checks from customer $i$ assigned to lockbox location $j$ cannot exceed some capacity limit $M$ if the location is used (binary variable $y_j=1$), but should be zero if the location is not used ($y_j=0$). This is formulated as $$x_{ij}-My_j\le 0.$$
Rounding error
My first course in graduate school was in numerical analysis, and it came as a bit of a shock to my system. We covered rounding, truncation and representation errors rather extensively -- very disconcerting stuff to a pure mathematician. One would like to think that computers do arithmetic accurately. One would like to believe that politicians are honest and competent. One is destined to be disappointed both times.I won't go into a lengthy introduction to the aforementioned types of error (which I'll somewhat cavalierly lump together under the name "rounding error"). Let me instead give a small example. The code is in Python (which represents all floating point numbers in double precision), but it could be any language.
x = 1.0e10 y = 1.0e-9 z = x - y w = z - x print x, y, z, w print (z == x) print (w == -y), (w == 0)The output of the first print line is:
1.00000000000000e10 1.00000000000000e-9 1.00000000000000e10 0.000000000000000Note that z looks suspiciously like x (it should be strictly smaller), and w (which should be -y) is zero. That could just be limited precision in the output. The next three prints, however, tell the tale. The first (z==x) and third (w==0) should be false, while the second (w==-y) should be true. What we actually get:
True False TrueWhoops!
This leads me to a few truisms from that long-ago numerical analysis course:
- Addition or subtraction of numbers introduce rounding errors, and addition or subtraction of numbers with substantially different magnitudes introduce rounding headaches. (Multiplication and division can also introduce small errors, but addition and subtraction tend to be the killers. The biggest problem with multiplication or division is probably overflow or underflow.)
- Comparisons are closet subtractions. It's generally unsafe to test $x=y$ when $x$ and $y$ are floating point numbers. Instead, test $|x-y|\lt \epsilon$ for a suitably small (but not too small) $\epsilon \gt 0$.
- In particular, things that should be zero often come out not quite zero (commonly referred to as "decimal dust"). More than once I've seen someone ask why some solver gave an optimal solution to his problem that had a binary variable equal to 1.0e-9 (throw in a few more junk digits if you like). It should be either 0 or 1! Well, yes; but if the solver held out for exactly 0 or exactly 1, you'd get a lot of wrong answers due to decimal dust. So the solver accepts that "darned near 0" is 0 and "darned near 1" is 1.
$M$ in the objective or RHS
Having $M$ in the objective function (as in the first definition of "big M" above, eliminating phase I of the simplex method) or having it as the right hand side of a constraint introduces rounding error (per my first bullet above). Specifically, $M$ in the objective can cause rounding errors in reduced costs, which may confuse the solver about which columns are candidates for entry to the basis. $M$ in the right hand side can introduce rounding errors in the right hand sides of subsequent tableaus, which may cause problems selecting the variable to leave the basis. As far as I know, those issues are generally survivable, and I've not seen any crash landings caused by it. Perhaps someone who has will comment with an example of a bad outcome from $M$ on the perimeter of the model.$M$ in the constraints
The lockbox formulation illustrates the situation where $M$ shows up as a coefficient in the constraint matrix. This is where serious misadventures can occur. If $M$ creeps into the basis matrix $\textbf{B}$ (or a column that is a candidate for entry into the basis matrix), the ensuing rounding errors can cause $\textbf{B}$ to look singular (or can cause a column that should be ineligible for entry to the basis, because it would make $\textbf{B}$ singular, look eligible). The rounding errors can also cause severe loss of precision in the computation of $\textbf{B}^{-1}$. In technical terms, $M$ (or $1/M$) appearing in $\textbf{B}$ can make $\textbf{B}$ ill-conditioned.Suppose that, in a formulation with $M$ in constraints, we bump into the following candidate for a basis matrix$$\textbf{B}=\left[\begin{array}{ccc} 1 & 0 & \frac{1}{M} \\ M & 0 & 1 \\ 1 & M & 0\end{array}\right].$$Since $$\textbf{B}\ \left[\begin{array}{c}1 \\ \frac{-1}{M} \\ -M\end{array}\right]=\left[\begin{array}{c}0\\0\\0\end{array}\right]$$it is obvious that $\textbf{B}$ is singular. Right? Well, let's check the determinant of $\textbf{B}$. I did the calculations using Sage 4.7, but I have no reason to think other packages would do any better (although pathologies might well vary).
Here is the Sage code I used to compute the determinants:
B = matrix(3,3,[1.0,0.0,1.0/M,M,0.0,1.0,1.0,M,0.0]); [det(B.substitute(M=10.^k)) for k in (28, 29, 30, 31)];Here is a table of the results:$$\begin{array}{rccc}M & 10^{28} & 10^{29} \\ \det(\textbf{B}) & 0.000000000000000 & 0.000000000000000 \\ \\
M & 10^{30} & 10^{31} \\ \det(\textbf{B}) & -1.40737488355328e14 & 0.000000000000000\end{array}$$
So something inexplicable (but bad) happened when $M=10^{30}$, presumably due to loss of precision in some key computation.
You won't see exactly this glitch in a typical "big M" model, but if you play with enough "big M" formulations, sooner or later (probably sooner) you will trip over numerical instability attributable to mixing disgustingly large coefficients with considerably smaller ones.
$M$ in MIP models
As I mentioned above, one common use of "big M" formulations (illustrated by the lockbox example) is to allow a binary variable to turn a constraint on or off. Even if ill-conditioned basis matrices do not occur, large values of $M$ can cause branch-and-bound solvers to make slow progress solving the mixed-integer programming model (MIP). Consider our simple lockbox constraint: $x_{ij}-My_j\le 0$. There will be an associated penalty cost (say, $c_j y_j$) in the objective function for using location $j$. Now suppose that, at some point, the solver is considering a node where $x_{1j}=2$, $x_{2j}=5$ and $x_{ij}=0$ for other values of $i$. Logically, we know this requires $y_j=1$ and incurs cost $c_j$; but in the LP-relaxation of the node problem, $y_j=5/M$ is sufficient, incurring a cost of just $5c_j/M$. For large values of $M$, this substantially underestimates the true cost, leading to loose node bounds. Loose bounds at nodes in turn make it hard for the solver to prune nodes based on objective value, and so more nodes need to be examined, slowing the solution process.$M$ and (dubious) pedagogy
I can't write a post this long without at least one rant. Let me stipulate up front that neither text book authors nor instructors can be expected to cover every exigency. That said, many instructors and authors introduce students to "big M" formulations in the abstract, because they are mathematically neat and easy to explain, and then move on to the next topic without regard to the grubby implementation details. I confess to having done this myself when I was young and foolish.At least one thing, though, is simply notational laziness. That $M$ in my lockbox example? It should be $M_j$. There is no reason why a single value should be used for all instances of $M$ in a model, and very good reasons why the values should differ. I don't recall seeing $M$ subscripted very often, if ever. (Are text book authors charged for each subscript they use?) In certain examples, including lockbox problems, there is a natural and obvious value for $M$ (typically a capacity limit), and in those cases you see individualized values being used. More often, though, there is just this undifferentiated symbol $M$ floating throughout the model. I think that influences users to actually use a single, very large value everywhere. (I recently saw a model on a CPLEX support forum that was clearly a "big M" model with a single large value of $M$ in about half the constraints.)
Concluding thoughts...
... and if you're still with me, you're very ready for this post to conclude.- If you are going to use a "big M" formulation, look for the smallest values of $M$ that work in the context of the model. (I once wrote a paper in which I analytically derived a sufficiently large but not horribly inflated value for $M$. There are tricks for doing so.)
- $M$ does not need to be one size fits all.
- Particularly if $M$ will show up as the coefficient of a variable in a constraint, be prepared for numerical instability. Learn how to detect numerical problems and what levers your solver has for coping with instability.
- Consider alternatives to a "big M" approach (two phase simplex, Benders decomposition, ...) if you have stability or accuracy problems you cannot resolve.
Another idea is to invent a class which has two components: M's component and the common number component. Moreover, the mathematical operators concerning this new class need to be implemented. I guess this will have more work to do.
ReplyDeleteSo worth reading. Every single line of it.
ReplyDeleteExcellent!
ReplyDeleteI will add in some situation the big M formulation is a question of "laziness" i.e the model should be treated in a two stage approach, one handling feasibility issue and the other optimality.
@fbahr, @Bo: Thanks for the kind words. :-)
ReplyDelete@Bo: Good point about the two stage approach. My models are always feasible (something about the nature of the problems I work on, I guess), so I lose track of this.
@komar: I had a similar thought years ago -- either treat it similar to complex numbers (i -> M), but with different arithmetic rules, or if necessary replace scalars with polynomials in M (and 1/M). I'm not sure if defining an algebra where M is an idempotent (M^2 = M) and 1/M is treated as zero would retain sufficient precision. Pivoting would be more work, and storage would increase. Ultimately, I decided that it wasn't worth pursuing; the time might be better invested in finding ways to choose reasonable values for M (or looking for alternatives to using M). I might have been short-sighted, though. Maybe this will be someone's dissertation someday.
Very nice discussion, Paul.
ReplyDeleteA couple of points on big M in the simplex method:
- One reason you don't hear complaints is that almost nobody actually implements big M. Most simplex codes use some variant of the two-phase method.
- One can always reoptimize the big-M solution using the original objective. If M is well chosen, the big-M solution is feasible, so it can be used as a starting solution. And (with some luck) it should be at least close to optimal for the original objective. Now you essentially have a two-phase method that takes account of the original objective in phase I.
I have sometimes wondered if big M could be implemented correctly with IEEE Infinity, which IIRC satisfies your hypothetical algebra rules.
@Matt: Thanks. You're right that simplex codes don't use the penalty "big M" method, but (per Bo's comment) I think sometimes users do it in a (misguided?) attempt to establish feasibility (or reason for infeasibility) and optimality in one swell foop, rather than either solve a feasibility problem first or rely on an IIS finder in the solver.
ReplyDeleteRe IEEE infinity, you would need to use a floating point library that didn't spit up when doing arithmetic on infinity. Also, I'm embarrassed to admit that I can't recall whether IEEE standard thinks 1/infinity is zero, NaN or what.
Apparently, 1.0/Infinity == 0 (which makes sense to me). The current C standard is supposed to support IEEE 754, IIRC, as is Java.
ReplyDeleteThis is fantastic, and I'm definitely going to point people to this when the need arises. One thing I might have added is that an overly large big-M in MIP models often lets you cheat; for example, the 'integer' variable could take on a value of epsilon; it would be considered integer feasible with respect to the integrality tolerance yet the big-M might be large enough to allow some related variable to be nonzero (ex: to allow a value without incurring the associated fixed cost).
ReplyDelete@Greg: Great point. I was thinking of epsilon values for the binary variable in the relaxation solution, but of course you're right -- with M big enough, that epsilon looks integer-feasible (to within tolerance).
ReplyDeleteGreat explanation Paul.
ReplyDeleteThanks, David.
ReplyDeleteI wish all our MOSEK customers would read your post.
ReplyDelete@Erling: Thanks. That's the problem right now, though; I suspect that I'm preaching mostly to the choir.
ReplyDeleteBy the way, SCIP also features "indicator constraints", with an additional binary variable for that "either/or" constraint you had. This will then decompose into a "variable bound constraint", using a big-M that is computed from the variable bounds in the problem automatically (I think).
ReplyDeleteCPLEX also has "indicator constraints", and I wouldn't be shocked if Gurobi and/or Xpress did as well. AFAIK, there's two ways to implement indicator constraints, and I have no idea which solvers use which methods. Method 1 is as you describe: turn them into big-M constraints, with the software taking a stab at a (hopefully tight) value of M. Method 2 is to handle them in the branching logic - wait (hopefully not too long) until you branch on the binary variable, then add the constraint to one child and not to the other. As noted above, the first approach can cause the constraint to have minimal impact on bounds until the binary variable gets fixed. The second approach seems to imply that the constraint has _no_ impact on bounds until the binary variable gets fixed. The second approach does, however, avoid numerical stability problems caused by M.
ReplyDeleteYes, as Paul mentioned, indicator constraints, while removing the numerical stability issues caused by M, potentially result in a weaker relaxation that makes the MIP model more difficult to solve. However,
ReplyDeleteCPLEX does a pretty good job of preprocessing variable bounds and figuring out the smallest possible value of M it can legitimately use.
It can then select among Method 1 and Method 2 from Paul's preceding post.
If you have to use big M in the formulation, one rule I have found helpful consists of setting the optimizer's integrality tolerance to a value smaller than 1/M. That limits the amount of trickle flow that can creep in as Greg Glockner mentioned above.
And finally, note that in addition to indicator constraints, you can remove big Ms from the formulation using type 1 SOSs. If you have a nonnegative continuous variable x and binary z with
x - Mz <= 0
you can reformulate it by introducing a new binary y and
y + z = 1
{x,y} an SOS1
If z = 0, then y = 1, and the SOS1 forces x to 0.
If z = 1, then y = 0, and the SOS1 doesn't impose any restrictions on x.
Sorry, forgot to mention one thing about the SOS approach above: it has the same issue regarding the indicator constraint approach regarding the potential weakness of the relaxation.
ReplyDelete