Monday, July 11, 2011

Perils of "Big M"

A somewhat recent Twitter exchange with Bo Jensen, which I believe was triggered by a post on an optimization forum (details of which are fading rapidly from my memory), motivated this entry. It deals with optimization models (specifically linear or integer linear ones) that are constructed using what is sometimes referred to as the "big M" method.


First, I need to clarify terminology.  Any reference to "big M" means the incorporation into a model of coefficients (usually represented by $M$) that are in a sense surrogates for $\infty$. The phrase "big M method" is frequently used to cover two somewhat different modeling techniques.
  • One eliminates the first phase (finding a feasible solution) of the two-phase simplex method by merging the phase I and phase II objective functions into a single function, penalizing the artificial variables with very large coefficients ($M$). It's an example of using a penalty function, and requires a sufficiently large value of $M$ to ensure that violating a constraint is never optimal.
  • The other provides a way for binary variables to turn constraints on or off. Consider, for example, a lockbox problem in which the number $x_{ij}$ of checks from customer $i$ assigned to lockbox location $j$ cannot exceed some capacity limit $M$ if the location is used (binary variable $y_j=1$), but should be zero if the location is not used ($y_j=0$). This is formulated as $$x_{ij}-My_j\le 0.$$

Rounding error

My first course in graduate school was in numerical analysis, and it came as a bit of a shock to my system. We covered rounding, truncation and representation errors rather extensively -- very disconcerting stuff to a pure mathematician. One would like to think that computers do arithmetic accurately. One would like to believe that politicians are honest and competent. One is destined to be disappointed both times.

I won't go into a lengthy introduction to the aforementioned types of error (which I'll somewhat cavalierly lump together under the name "rounding error"). Let me instead give a small example. The code is in Python (which represents all floating point numbers in double precision), but it could be any language.
x = 1.0e10
y = 1.0e-9
z = x - y
w = z - x
print x, y, z, w
print (z == x)
print (w == -y), (w == 0)
The output of the first print line is: 
1.00000000000000e10 1.00000000000000e-9 1.00000000000000e10
 Note that z looks suspiciously like x (it should be strictly smaller), and w (which should be -y) is zero. That could just be limited precision in the output. The next three prints, however, tell the tale. The first (z==x) and third (w==0) should be false, while the second (w==-y) should be true. What we actually get:
False True

This leads me to a few truisms from that long-ago numerical analysis course:
  • Addition or subtraction of numbers introduce rounding errors, and addition or subtraction of numbers with substantially different magnitudes introduce rounding headaches. (Multiplication and division can also introduce small errors, but addition and subtraction tend to be the killers. The biggest problem with multiplication or division is probably overflow or underflow.)
  • Comparisons are closet subtractions. It's generally unsafe to test $x=y$ when $x$ and $y$ are floating point numbers. Instead, test $|x-y|\lt \epsilon$ for a suitably small (but not too small) $\epsilon \gt 0$.
  • In particular, things that should be zero often come out not quite zero (commonly referred to as "decimal dust"). More than once I've seen someone ask why some solver gave an optimal solution to his problem that had a binary variable equal to 1.0e-9 (throw in a few more junk digits if you like). It should be either 0 or 1!  Well, yes; but if the solver held out for exactly 0 or exactly 1, you'd get a lot of wrong answers due to decimal dust. So the solver accepts that "darned near 0" is 0 and "darned near 1" is 1. 

    $M$ in the objective or RHS

    Having $M$ in the objective function (as in the first definition of "big M" above, eliminating phase I of the simplex method) or having it as the right hand side of a constraint introduces rounding error (per my first bullet above). Specifically, $M$ in the objective can cause rounding errors in reduced costs, which may confuse the solver about which columns are candidates for entry to the basis. $M$ in the right hand side can introduce rounding errors in the right hand sides of subsequent tableaus, which may cause problems selecting the variable to leave the basis. As far as I know, those issues are generally survivable, and I've not seen any crash landings caused by it. Perhaps someone who has will comment with an example of a bad outcome from $M$ on the perimeter of the model.

    $M$ in the constraints

    The lockbox formulation illustrates the situation where $M$ shows up as a coefficient in the constraint matrix. This is where serious misadventures can occur. If $M$ creeps into the basis matrix $\textbf{B}$ (or a column that is a candidate for entry into the basis matrix), the ensuing rounding errors can cause $\textbf{B}$ to look singular (or can cause a column that should be ineligible for entry to the basis, because it would make $\textbf{B}$ singular, look eligible). The rounding errors can also cause severe loss of precision in the computation of $\textbf{B}^{-1}$. In technical terms, $M$ (or $1/M$) appearing in $\textbf{B}$ can make $\textbf{B}$ ill-conditioned.

    Suppose that, in a formulation with $M$ in constraints, we bump into the following candidate for a basis matrix$$\textbf{B}=\left[\begin{array}{ccc} 1 & 0 & \frac{1}{M} \\ M & 0 & 1 \\ 1 & M & 0\end{array}\right].$$Since $$\textbf{B}\ \left[\begin{array}{c}1 \\ \frac{-1}{M} \\ -M\end{array}\right]=\left[\begin{array}{c}0\\0\\0\end{array}\right]$$it is obvious that $\textbf{B}$ is singular. Right? Well, let's check the determinant of $\textbf{B}$. I did the calculations using Sage 4.7, but I have no reason to think other packages would do any better (although pathologies might well vary).

    Here is the Sage code I used to compute the determinants:
    B = matrix(3,3,[1.0,0.0,1.0/M,M,0.0,1.0,1.0,M,0.0]);
    [det(B.substitute(M=10.^k)) for k in (28, 29, 30, 31)];
    Here is a table of the results:$$\begin{array}{rccc}M & 10^{28} & 10^{29} \\ \det(\textbf{B}) & 0.000000000000000 & 0.000000000000000 \\  \\
    M & 10^{30} & 10^{31} \\ \det(\textbf{B}) & -1.40737488355328e14 & 0.000000000000000\end{array}$$
    So something inexplicable (but bad) happened when $M=10^{30}$, presumably due to loss of precision in some key computation.

    You won't see exactly this glitch in  a typical "big M" model, but if you play with enough "big M" formulations, sooner or later (probably sooner) you will trip over numerical instability attributable to mixing disgustingly large coefficients with considerably smaller ones.

    $M$ in MIP models

    As I mentioned above, one common use of "big M" formulations (illustrated by the lockbox example) is to allow a binary variable to turn a constraint on or off. Even if ill-conditioned basis matrices do not occur, large values of $M$ can cause branch-and-bound solvers to make slow progress solving the mixed-integer programming model (MIP). Consider our simple lockbox constraint: $x_{ij}-My_j\le 0$. There will be an associated penalty cost (say, $c_j y_j$) in the objective function for using location $j$. Now suppose that, at some point, the solver is considering a node where $x_{1j}=2$, $x_{2j}=5$ and $x_{ij}=0$ for other values of $i$. Logically, we know this requires $y_j=1$ and incurs cost $c_j$; but in the LP-relaxation of the node problem, $y_j=5/M$ is sufficient, incurring a cost of just $5c_j/M$. For large values of $M$, this substantially underestimates the true cost, leading to loose node bounds.  Loose bounds at nodes in turn make it hard for the solver to prune nodes based on objective value, and so more nodes need to be examined, slowing the solution process.

    $M$ and (dubious) pedagogy

    I can't write a post this long without at least one rant. Let me stipulate up front that neither text book authors nor instructors can be expected to cover every exigency. That said, many instructors and authors introduce students to "big M" formulations in the abstract, because they are mathematically neat and easy to explain, and then move on to the next topic without regard to the grubby implementation details. I confess to having done this myself when I was young and foolish.

    At least one thing, though, is simply notational laziness. That $M$ in my lockbox example? It should be $M_j$. There is no reason why a single value should be used for all instances of $M$ in a model, and very good reasons why the values should differ. I don't recall seeing $M$ subscripted very often, if ever. (Are text book authors charged for each subscript they use?) In certain examples, including lockbox problems, there is a natural and obvious value for $M$ (typically a capacity limit), and in those cases you see individualized values being used. More often, though, there is just this undifferentiated symbol $M$ floating throughout the model. I think that influences users to actually use a single, very large value everywhere. (I recently saw a model on a CPLEX support forum that was clearly a "big M" model with a single large value of $M$ in about half the constraints.)

    Concluding thoughts...

    ... and if you're still with me, you're very ready for this post to conclude.
    • If you are going to use a "big M" formulation, look for the smallest values of $M$ that work in the context of the model. (I once wrote a paper in which I analytically derived a sufficiently large but not horribly inflated value for $M$. There are tricks for doing so.)
    • $M$ does not need to be one size fits all.
    • Particularly if $M$ will show up as the coefficient of a variable in a constraint, be prepared for numerical instability. Learn how to detect numerical problems and what levers your solver has for coping with instability.
    • Consider alternatives to a "big M" approach (two phase simplex, Benders decomposition, ...) if you have stability or accuracy problems you cannot resolve.


    1. Another idea is to invent a class which has two components: M's component and the common number component. Moreover, the mathematical operators concerning this new class need to be implemented. I guess this will have more work to do.

    2. So worth reading. Every single line of it.

    3. Excellent!

      I will add in some situation the big M formulation is a question of "laziness" i.e the model should be treated in a two stage approach, one handling feasibility issue and the other optimality.

    4. @fbahr, @Bo: Thanks for the kind words. :-)

      @Bo: Good point about the two stage approach. My models are always feasible (something about the nature of the problems I work on, I guess), so I lose track of this.

      @komar: I had a similar thought years ago -- either treat it similar to complex numbers (i -> M), but with different arithmetic rules, or if necessary replace scalars with polynomials in M (and 1/M). I'm not sure if defining an algebra where M is an idempotent (M^2 = M) and 1/M is treated as zero would retain sufficient precision. Pivoting would be more work, and storage would increase. Ultimately, I decided that it wasn't worth pursuing; the time might be better invested in finding ways to choose reasonable values for M (or looking for alternatives to using M). I might have been short-sighted, though. Maybe this will be someone's dissertation someday.

    5. Very nice discussion, Paul.

      A couple of points on big M in the simplex method:

      - One reason you don't hear complaints is that almost nobody actually implements big M. Most simplex codes use some variant of the two-phase method.

      - One can always reoptimize the big-M solution using the original objective. If M is well chosen, the big-M solution is feasible, so it can be used as a starting solution. And (with some luck) it should be at least close to optimal for the original objective. Now you essentially have a two-phase method that takes account of the original objective in phase I.

      I have sometimes wondered if big M could be implemented correctly with IEEE Infinity, which IIRC satisfies your hypothetical algebra rules.

    6. @Matt: Thanks. You're right that simplex codes don't use the penalty "big M" method, but (per Bo's comment) I think sometimes users do it in a (misguided?) attempt to establish feasibility (or reason for infeasibility) and optimality in one swell foop, rather than either solve a feasibility problem first or rely on an IIS finder in the solver.

      Re IEEE infinity, you would need to use a floating point library that didn't spit up when doing arithmetic on infinity. Also, I'm embarrassed to admit that I can't recall whether IEEE standard thinks 1/infinity is zero, NaN or what.

    7. Matthew SaltzmanJuly 12, 2011 at 2:52 PM

      Apparently, 1.0/Infinity == 0 (which makes sense to me). The current C standard is supposed to support IEEE 754, IIRC, as is Java.

    8. This is fantastic, and I'm definitely going to point people to this when the need arises. One thing I might have added is that an overly large big-M in MIP models often lets you cheat; for example, the 'integer' variable could take on a value of epsilon; it would be considered integer feasible with respect to the integrality tolerance yet the big-M might be large enough to allow some related variable to be nonzero (ex: to allow a value without incurring the associated fixed cost).

    9. @Greg: Great point. I was thinking of epsilon values for the binary variable in the relaxation solution, but of course you're right -- with M big enough, that epsilon looks integer-feasible (to within tolerance).

    10. I wish all our MOSEK customers would read your post.

    11. @Erling: Thanks. That's the problem right now, though; I suspect that I'm preaching mostly to the choir.

    12. By the way, SCIP also features "indicator constraints", with an additional binary variable for that "either/or" constraint you had. This will then decompose into a "variable bound constraint", using a big-M that is computed from the variable bounds in the problem automatically (I think).

    13. CPLEX also has "indicator constraints", and I wouldn't be shocked if Gurobi and/or Xpress did as well. AFAIK, there's two ways to implement indicator constraints, and I have no idea which solvers use which methods. Method 1 is as you describe: turn them into big-M constraints, with the software taking a stab at a (hopefully tight) value of M. Method 2 is to handle them in the branching logic - wait (hopefully not too long) until you branch on the binary variable, then add the constraint to one child and not to the other. As noted above, the first approach can cause the constraint to have minimal impact on bounds until the binary variable gets fixed. The second approach seems to imply that the constraint has _no_ impact on bounds until the binary variable gets fixed. The second approach does, however, avoid numerical stability problems caused by M.

    14. Yes, as Paul mentioned, indicator constraints, while removing the numerical stability issues caused by M, potentially result in a weaker relaxation that makes the MIP model more difficult to solve. However,
      CPLEX does a pretty good job of preprocessing variable bounds and figuring out the smallest possible value of M it can legitimately use.
      It can then select among Method 1 and Method 2 from Paul's preceding post.

      If you have to use big M in the formulation, one rule I have found helpful consists of setting the optimizer's integrality tolerance to a value smaller than 1/M. That limits the amount of trickle flow that can creep in as Greg Glockner mentioned above.

      And finally, note that in addition to indicator constraints, you can remove big Ms from the formulation using type 1 SOSs. If you have a nonnegative continuous variable x and binary z with

      x - Mz <= 0

      you can reformulate it by introducing a new binary y and

      y + z = 1
      {x,y} an SOS1

      If z = 0, then y = 1, and the SOS1 forces x to 0.
      If z = 1, then y = 0, and the SOS1 doesn't impose any restrictions on x.

    15. Sorry, forgot to mention one thing about the SOS approach above: it has the same issue regarding the indicator constraint approach regarding the potential weakness of the relaxation.

    16. how to write this constraints let say x(i,j,k) is a binary variable and if x(i,j,k)=1 then A(i) >= B(j,k) ?

    17. Hi Paul,

      Thank you for this post. As, you wrote in this post that, once you wrote a paper on analytical derivation to select value for M. Could you please refer the paper here? I would like to read the paper.



      1. Dewan,

        The paper is:

        Rubin, P. A. (1990) Heuristic Solution Procedures for a Mixed-Integer Programming Discriminant Model. Managerial and Decision Economics 11, 255-266.

        Calculation of the M values starts on the right side of page 257 and ends with equation (12). The calculations are quite specific to the particular application (discriminant analysis).

      2. Thank you very much Paul.

    18. Hello Paul,

      Firstly i would like to thank you for this post.
      In my formulation, i use the big M to linearize Z=A*B knowning that B is boolean variable and (Z and A) are integer, to do this i have the following constraints:


      My question is about the imprecision of the objective function? Is it possible to get this imprecision? If you have some papers, please give them to me.


      1. You're welcome. Unfortunately, I don't know of any formal discussions of how to estimate imprecision when you introduce M. Imprecision will of course be in the variable values and indirectly, through them, in the objective.

        You want M to be the tightest a priori upper bound for A you can find. If you are worried that your M value is causing precision problems, you can try tweaking solver settings that affect numerical precision, or you can adapt the trick Ed Klotz posted in his first comment. Define a second binary variable B' and two nonnegative variables U, V, and add the constraints
        A <= Z + U,
        Z <= A,
        Z <= V,
        B + B' = 1.
        Also make {U, B} and {V, B'} SOS1 sets. This will produce the desired result with no precision adventures. Adding the extra binary variable should not materially affect solution time, since B + B' = 1 makes them an SOS1.

    19. Hi Paul,

      I wonder if is it possible to use the Big M to linearize "A/(B*C)", such as: A is a real variable, C is a parameter and B is a real variable that takes 1 or epsilon (very small value between <= to 1 and > to 0).

      The objective is: if B=1 then A/C
      if B=epsilon then 0

      Thanks a lot for this post

      Best regards

      1. You want $B$ to have only those two possible values (1 or $\epsilon$)?
        Let $z=A/(B*C)$, let $M_{1},\dots,M_{4}$ be suitably large positive
        constants, and let $y$ be a binary variable. Add the following:
        z\le A/C+M_{1}(1-y)\\
        z\ge A/C-M_{2}(1-y)\\
        z\le M_{3}y\\

    20. Ok thanks Paul,

      But in some cases Z= M i don't get the correct value for z.

      i put M1=M2=M3=M4=1000

      Do you think the problem comes with the value chosen for M ?

    21. $z=M_i$ should not be possible. You made $y$ binary, right? If $y=0$, the third and fourth constraints resolve to $0\le z\le 0$, so $z$ must be 0. If $y=1$, the first and second constraints resolve to $A/C\le z\le A/C$, so $z$ must equal $A/C$. If $M_3$ and $M_4$ are chosen too small (less than $A/C$ can be), then either $y=0$, $z=0$ is forced or the model becomes infeasible. Similarly, if $M_1$ and $M_2$ are chosen too small to allow $z=0$, then $y=1$ is forced (or the model becomes infeasible).

      I don't think the values of $M_i$ are large enough to cause serious numerical instability, so I suggest you double-check your model and make sure everything is coded correctly.

    22. Dear,
      My name is sothea, it is to see your discusion about Big M method. I use this method to linearise my problem but i dont know whether i have instability issue. In my model, i use M=1000. As i understand from Paul's command, it is not enough value to cause a numerial problem, right? I have another question about computation time. between using big M methode and indicator variable, which one is better in term of computation time? and how to set SOS in Cplex? do you have any experience to solv 1 million of coeffient nonzero of MIQP in Cplex? in fact my problem i have a model using MIQP which has about 1 million of coeffient nonzero. I tried to solve this but i never reached the integer solutions. Do you have any sugestion to handle it? Thanks in advance,

      1. "In my model, i use M=1000. As i understand from Paul's command, it is not enough value to cause a numerial problem, right?"

        That's hard to say. I've seen models with considerably larger values of $M$ and no stability problems; but I have also seen models with no coefficient larger than two or three digits and yet with stability issues. It very much depends on the specific coefficient matrix.

        between using big M methode and indicator variable, which one is better in term of computation time?

        I assume you mean indicator constraints? If you use indicator constraints, CPLEX may convert them to a big M model, with CPLEX inferring a value of $M$. So the best answer I can give you is that if you can come up with a tighter (but still valid) value of $M$ than CPLEX can, big M may be faster; but if CPLEX finds a better $M$ than you do, indicator constraints may be faster.

        how to set SOS in Cplex?

        In the Java API, use IloCplexModeler.addSOS1 or IloCplexModeler.addSOS2. In the C++ API, there are IloSOS1 and IloSOS2 classes, which are subclasses of IloConstraint. Something similar should exist for other APIs.

        do you have any experience to solv 1 million of coeffient nonzero of MIQP in Cplex?

        No, neither: I never deal with integer quadratic models, and I have never come any where near one million nonzeros. I'm happy to restrict myself to small, linear models. :-)

    23. Thank you so much for this post! It was really helpful in understanding some of the shortcomings of the "big M" method. Is there any reference (article) that further develops this subject? Thank you!

      1. You're welcome! As far as a reference goes, I've seen it come up (at least tangentially) in one or two slide shows/presentations about numerical difficulties (at least once in a presentation by a solver vendor), but I'm not aware of a comprehensive treatment in a book or journal article. There's a nice paper by Codato and Fischetti (Operations Research, 2006, 54, pp. 756-766), for example, that includes statements about big M formulations such as "[...] due to the presence of the big-M coefficients, the LP relaxation of the MIP model is typically very poor", but without proof or citation, or any discussion as to why.

        I imagine there must be a full treatment of it in print somewhere, but my personal experience was that this is one of a number of topics that is not typically discussed in text books or taught in college courses, and instead is left to the user to learn through the "school of hard knocks".

    24. Hi, thaks for the article, is there any way to get rid of Big-M when forming Mccormick envelopes ?
      x1 >= x2 - BigM*(1-w)
      x1 <= x2
      x1 <= BigM*(w)

      where x1,x2 are continuous, and w is a binary

      1. You're welcome, and not that I know of (in that order). What you definitely can do is try to find the tightest possible values for $M$. Here are three options (in descending order of the tightness of the bounds and ascending order of ease/speed of computing the bounds):

        1. Solve two MIPs, using the same constraints as the original problem (other than the McCormick envelopes). In the first, maximize $x_2 - x_1$. The objective value gives you $M$ for the first constraint. In the second, maximize just $x_1$. The objective value gives you $M$ for the third constraint.

        2. Same as (1), but use the LP relaxation (faster, but with a possible looser bound).

        3. Assuming $a_i \le x_i \le b_i$ for $i = 1,2$ with $a_i$ and $b_i$ (finite) parameters, set $M=b_2 - a_1$ in the first constraint and $M=b_1$ in the third constraint.

    25. This comment has been removed by the author.

    26. Hi Paul,

      Thank you for the valuable comments. I have used big M for linearizing multiplication of a binary and a continuos varible. When I play around with M value, I observe changes in my objective function value. What might be the problem? Is it about precision?

      1. If you make $M$ too small (meaning smaller than the optimal value of the product), obviously that can affect your objective value. Also, the integrality tolerance for the solver you are using (how close the computed value of the binary variable $x$ must be before the solver accepts it as satisfying the integrality condition) can come into play. Let's say that part of your linearization is $z\le M x$, where $z$ represents the product term and $x$ is binary, and that your integrality tolerance setting is slightly greater than $\epsilon > 0$ (meaning that any value within $\epsilon$ of an integer will be considered "integer"). Let's also assume that the optimal value of $z$ is $z^* \gt 0$. If $M\ge z^*/\epsilon$, then the solver will accept $z=z^*$ and $x=\epsilon$ as part of a feasible, and possibly optimal solution, with $x$ effectively zero but $z$ not zero due to rounding/tolerance issues.

        Two things to try would be to get a sample of basis condition numbers from the solver (to get a feel for whether $M$ is causing numerical instability), and to try fiddling with $M$ while the solver's integrality tolerance is cranked down to zero (which may cause the solver to take longer to prove optimality). It's possible that the solver's feasibility tolerance (how much rounding is allowed in constraints) may also come into play.

        Last comment: in my opinion, all variables in practical models should be bounded whenever possible (with "reasonable" bounds, not just arbitrarily large upper bounds). When linearizing the product of a binary variable and a bounded continuous variable, the bounds of the continuous variable fill the role of $M$. Using a smaller $M$ risks incorrectness, while using a larger $M$ is unnecessary (slows convergence, risks rounding or stability problems). If you do not have obvious bounds on the continuous variables, you may want to invest some time finding correct but reasonably tight bounds on them, especially if the alternative is the dreaded 10 to some large power for $M$.

    27. Thanks for a great post. Why question is why "M" and not some other letter? I am curious if you know any history behind this and its first use. Thanks!

      1. Sarah: Good question! No, I don't know the formal history of it. I can hazard a guess. Back when the method was first developed, papers were produced on typewriters (IBM Selectrics in my doctoral student days), and Greek letters were "expensive". Secretaries in our Math Dept. had to put a special die for each one, attached to a plastic handle, into the typewriter, type something (anything) causing it to strike the ribbon, then pull it out and iterate. (The hammer/die gizmos looked a bit like toothbrushes ... very strange toothbrushes.) That created a major incentive to stick with English (Latin) letters where possible.

        Add to that the informal mathematical conventions (dating back even further) that letters late in the alphabet tend to be variables, letters early in the alphabet tend to be constants, and letters in the vicinity of "i" tend to be subscripts, and your choices for the symbol in question narrow. A/a, B/b and C/c were probably already in wide use as constraint coefficients, right-hand sides and objective coefficients respectively. so the choices narrow a bit more.

        Mind you, this is guesswork on my part, and it still does not explain why "M" and not, for example, "H". The original choice of "M" was probably rather arbitrary on someone's part, and after that it was just repeated by people citing or influenced by the early use until it became a convention.

        That still leaves the question of who first introduced the method (and the letter "M").


    If this is your first time commenting on the blog, please read the Ground Rules for Comments. In particular, if you want to ask an operations research-related question not relevant to this post, consider asking it on OR-Exchange.