When teaching statistics to end users, I resist the exhortations of some of my colleagues that students must do computations by hand to “really learn” about statistics. For instance, said colleagues will argue that in order for a student to grasp the concept of correlation, they must take a set of observations of a pair of variables and (manually) center the variables, compute squares and cross-products, total the squares and cross-products, and use the totals to compute standard deviations and the Pearson correlation coefficient. To me, that is just gratuitous grunt work. Students need to know the domain of the correlation coefficient (-1 to +1), when to use it (looking for linear relationships) or not (looking for nonlinear relationships), and its interpretation (including the dangers of making a leap from correlation to causation). The actual computation can safely be left to software.

On the other hand, the widespread availability of modeling software and powerful solvers has changed the practice of optimization in a very fundamental way. Once the exclusive province of people who had taken extensive coursework in the mathematical theory behind the pertinent algorithms (not to mention formal training in numerical analysis), optimization is now increasingly being undertaken by people who apparently have minimal formal training, and who may understand the contextual domain of the problem they are trying to solve but perhaps do not understand the mathematical underpinnings of the methods being used. I've seen evidence of this in a number of forums recently. One of the symptoms is a failure to understand issues relating to rounding error and numerical stability, ranging from the subtle (poor scaling leading to incorrect answers or solver failures) to the blatant (“My binary variable came out 0.99999999999 -- what went wrong??”).

A few recent posts suggested that some users either do not understand the difference between local and global optima or do not understand the role of convexity in optimization models. In particular, I cringe whenever I see nonlinear equality constraints in a model, whereas they don't seem to bother the model authors very much. (I'm not saying that nonlinear equality constraints are fundamentally wrong; I'm just saying that I'd really like not to have any in my model.) A very simple example will illustrate why.

Consider the following trivial optimization problems:\[ \begin{array}{clccclc} \textrm{(D)} & \min & x & \,\,\, & \textrm{(C)} & \min & x\\ & \textrm{s.t.} & x^{2}+y^{2}\le1 & & & \textrm{s.t.} & x^{2}+y^{2}=1\\ & & -2\le x\le2 & & & & -2\le x\le2\\ & & -2\le y\le2 & & & & -2\le y\le2\end{array}\]The (convex) feasible region of problem (D) is a disk, while the (nonconvex) feasible region of problem (C) is a circle. The solution to both problems is clearly $(x,y)=(-1,0)$.

Let's perform a little experiment (using AMPL to model the problems and MINOS 5.51 to solve them). We will solve both (D) and (C) with initial values $(x,y)=(cos(\theta),sin(\theta))$ where $\theta$ is an integer multiple of $\frac{\pi}{4}$. The AMPL code is as follows:

# problem definitions var x >= -2 <= 2; var y >= -2 <= 2; minimize Obj: x; s.t. Disk: x^2 + y^2 <= 1; s.t. Circle: x^2 + y^2 == 1; problem D: x, y, Obj, Disk; problem C: x, y, Obj, Circle; # bit used to define starting locations param pi4 := atan(1); # pi/4 set ANGLE := 0..7; # from 0 to 7/4*pi # storage for optimal solutions param dsol {ANGLE}; param csol {ANGLE}; # solve disk problem from all starting angles problem D; for {t in ANGLE} { let x := cos(t*pi4); let y := sin(t*pi4); solve; let dsol[t] := x; } # solve circle problem from all starting angles problem C; for {t in ANGLE} { let x := cos(t*pi4); let y := sin(t*pi4); solve; let csol[t] := x; } display dsol, csol;The results? For all starting angles, MINOS determines that $(x,y)=(-1,0)$ is the optimal solution to (D). That also applies for (C), with one glaring exception: when $\theta=0$, MINOS decides that the initial value $(x,y)=(1,0)$ is the optimal solution.

The figure below tells the tale. MINOS uses a reduced gradient method, which is a souped-up form of steepest descent. If allowed to, it will move in the direction pointed to by the gradient of the objective function at the current point when maximizing, and in the opposite direction when minimizing. (In this example, the gradient is a constant $(1,0)$.) When movement in that direction is infeasible, it will project the positive or negative gradient onto the hyperplane tangent to the feasible region at the current point, and search in that direction.

The behavior of MINOS on problem (C) is correct given what's inside the black box. MINOS only guarantees local optima, and even that can only be delivered if the problem is moderately well behaved and/or the starting point is chosen well. Ipopt, another popular nonlinear solver, has similar limitations. From some recent posts that I've seen, though, I have the impression that some users are creating highly nonconvex models and expecting optimal solutions they are not destined to get. Compounding this, of course, is the fact that the solver will generally not know it has arrived at a suboptimal solution, and therefore will not alert the user that the solution is suboptimal.

Perhaps modeling languages and solvers should ship with a “warning label” stating that use of the software without first reading a book or taking a course in linear/nonlinear/integer programming is inadvisable.

Speaking of books, I can recommend David Luenberger's "Linear and Nonlinear Programming" (an updated version of which has apparently been published with coauthor Yinyu Ye).

You seem to suggest that understanding the solver algorithm is important independent of problem type. For LP and MIP, I don't think you need to know the solver innards. Your example makes sense for NLP, but do you have a similar example for LP and MIP?

ReplyDeleteFair question. For LPs, I think the answer is no. There are more and less efficient ways to model some LPs, and at times it may be useful to understand the impact of degeneracy or density, but it's hard to completely screw the pooch with an LP. LP users sometimes get confused about how to find alternate optima (or find out whether alternate optima exist), but you can answer that in terms of solver output without getting into algorithmic requirements.

ReplyDeleteFor MILPs, I'm a bit more ambivalent. (Convexity issues for MINLPs are covered by the post.) Understanding how branch-and-cut works (particularly the bounding part) may help with choices about whether to use a "big M" formulation rather than decomposition, how large to make M, whether to use "indicator constraints" or do the equivalent yourself (back to "big M", but with you rather than the solver picking M), etc. The nice thing about LPs and MILPs, though, is that you automatically get convexity; so, other than maybe thinking a pure LP solver can do a MILP, it's hard to pick the wrong tool for the job.

Poor scaling can lead to incorrect results in an LP or MILP, as can subtle modeling issues (too subtle to call them user errors). I don't know whether to classify this as a case where you need to understand something about LP/MILP algorithms, or whether to say it's a generic modeling issue that the user should understand without regard to the solver.