Friday, March 27, 2020

A Minimum Variance Partition Problem

Someone posed a question on Mathematics Stack Exchange that I answered there. This post is to explain the model I used (and generalize a bit).

In general terms, you start with a collection $\Omega=[\omega_1, \dots, \omega_n]$ of positive integers. I choose to think of the integers as weights. The mission is to partition $\Omega$ into subcollections $S_1, \dots, S_m$ such that (a) you use as few subcollections as possible (i.e., minimize $m$), (b) the variance of their weight totals is minimal, and (c) no subcollection in the partition has total weight greater than some specified constant $L$. (If, for example, $S_1 = [\omega_4, \omega_9, \omega_{10}]$ then the weight total for $S_1$ is $\omega_4 + \omega_9 + \omega_{10}$.)  Why am I saying "collections" rather than sets? This would be a set partitioning problem were it not for the fact that $\Omega$ may contain duplicate entries and so is technically a multiset.

The term "variance" is slightly ambiguous (population variance or sample variance), but happily that will not matter here. Let $$s_j =\sum_{i : \omega_i \in S_j} \omega_i$$be the weight of subcollection $S_j$. Since we are talking about a partition here (every item in exactly one subcollection), $$\sum_{j=1}^m s_j = \sum_{i=1}^n \omega_i = n \bar{\omega}$$where $\bar{\omega}$ is the (constant) mean of all the item weights, and so the mean of the subcollection weights is $$\bar{s} = \frac{n\bar{\omega}}{m},$$which for a given number $m$ of subcollections is constant. The variance of the subcollection weights is $$\frac{1}{m}\sum_{j=1}^m (s_j - \bar{s})^2=\frac{1}{m}\sum_{j=1}^m s_j^2-\bar{s}^2.$$To minimize that, we can just minimize $\sum_j s_j^2$ in the optimization model, and then do the arithmetic to get the variance afterward.

As originally articulated, the problem has two objectives: minimize the variance of the subcollection weights and minimize the number of subcollections used. Normally, a multiobjective problem involves setting priorities or assigning weights to objectives or something along those lines. For the original question, where there are only $n=12$ elements in $\Omega$ (and thus at most $m=12$ sets in the partition), a plausible approach is to find a minimum variance solution for each possible value of $m$ and present the user with the efficient frontier of the solution space, letting the user determine the trade-off between partition size and variance that most appeals to them. For the particular example in the SE question, I was a bit surprised to find that the variance increased monotonically with the number of sets in the partition, as seen in the following table.

# of Subcollections   Minimum Variance

1 infeasible
2 infeasible
3 infeasible
4 infeasible
5 672
6 1048.22
7 1717.39
8 3824.75
9 7473.73
10 9025.80
11 9404.81
12 11667.06

So the five subcollection partition is minimal both in number of subcollections and weight variance. I would be surprised if that generalized.

The QP formulation uses two sets of variables. For $i=1,\dots,n$ and $j=1,\dots,m$, $x_{ij}\in \lbrace 0, 1\rbrace$ determines whether $\omega_i$ belongs to $S_j$ ($x_{ij}=1$) or not ($x_{ij}=0$). For $j=1,\dots,m$, $s_j \ge 0$ is the total weight of subcollection $j$. The objective function is simply $$\min_{x,s} \sum_{j=1}^m s_j^2$$which happily is convex. The first constraint enforces the requirement that the subcollections form a partition (every element being used exactly once): $$\sum_{j=1}^m x_{ij} = 1 \quad\forall i\in \lbrace 1,\dots,n\rbrace.$$For a partition to be valid, we do not want any empty sets, which leads to the following constraint:$$\sum_{i=1}^n x_{ij} \ge 1\quad\forall j\in\lbrace 1,\dots,m\rbrace.$$ The next constraint defines the weight variables: $$s_j = \sum_{i=1}^n \omega_i x_{ij}\quad\forall j\in\lbrace 1,\dots,m\rbrace.$$What about the limit $L$ on the weight of any subcollection? We can enforce that by limiting the domain of the $s$ variables to $s_j\in [0,L]$ for all $j$.

Finally, there is symmetry in the model. One source is duplicate items in $\Omega$ ($\omega_i = \omega_{i'}$ for some $i\neq i'$). That happens in the data of the SE question, but it's probably not worth dealing with in an example that small. Another source of symmetry is that, given any partition, you get another partition with identical variance by permuting the indices of the subcollections. That one is easily fixed by requiring the subcollections to be indexed in ascending (or at least nondecreasing) weight order:$$s_j \le s_{j+1}\quad\forall j\in\lbrace 1,\dots,m-1\rbrace.$$Is it worth bothering with? With the anti-symmetry constraint included, total run time for all 12 models was between one and two seconds on my PC. Without the anti-symmetry constraint, the same solutions for the same 12 models were found in a bit over 100 minutes.

If you would like to see my Java code (which requires CPLEX), you can find it at

Tuesday, March 24, 2020

Approximating Nonlinear Functions: Tangents v. Secants

In a (much) earlier post [1], I discussed using SOS2 constraints to approximate a nonlinear function in a mixed-integer linear program (MILP). In this post, and probably the next one or two, I'm going to expand on that theme a bit. (By "a bit" I mean "endlessly".)


To save you some reading, in case you end up not interested, I'll summarize the take-aways here.
  1. We can approximate nonlinear functions in a MILP model using piecewise-linear functions based on either tangents or secants. Piecewise-linear functions implicitly add either binary variables or SOS2 constraints (or maybe both).
  2. For convex functions only appearing in $\le$ constraints or concave functions only appearing in $\ge$ constraints, we can say that using tangents may produce superoptimal solutions (look optimal but are actually infeasible) but will not produce suboptimal solutions, while using secants may produce suboptimal solutions but will not produce infeasible solutions. In any other case, super- and sub-optimality are both risks.
  3. For the same two combinations in the previous paragraph, we have a third option: using a set of linear inequalities based on tangents, rather than a piecewise-linear function. Besides being somewhat easier to set up, this has a potential advantage: using a callback, we can refine the approximation on the fly (by adding additional constraints based on new tangents as they are needed).

Gory details

Suppose that we are working on a MILP, and we need to incorporate a nonlinear function of some of the variables. To keep things simple, I'll illustrate with functions $f:\mathbb{R}\rightarrow\mathbb{R}$ of a single variable, but the concepts presented will extend (not necessarily easily) to functions of multiple variables. In what follows, I will refer to the set of points $(x,f(x))$ as the surface of $f$. A tangent to the surface at $x$ is a line (more generally, hyperplane) that touches the surface at $x$ and has the same slope as the surface at that one point. A secant to the surface is a line (more generally, hyperplane) that intersects the surface at multiple points. For a function of one dimension, a chord is a segment of a secant between two consecutive intersection points. ("Consecutive" means that there are no other intersections of secant and surface between the endpoints of the chord.) I'm not sure if the term "chord" is used in higher dimensions, although the concept extends to higher dimensions.

Figure 1 shows a convex function together with a few tangents (in blue).

Figure 1: Tangents to convex function
Figure 2 shows the same function with a few secants (in green). One of the secants intersects the surface at $(-5,15)$ and $(0,0)$. The dashed line segment between those points is a chord.
Figure 2: Secants to convex function
Figures 3 and 4 do the same for a concave function.
Figure 3: Tangents to concave function

Figure 4: Secants to concave function

A useful property of convex and concave functions is that tangents of convex functions are always less than or equal to the functions and tangents of concave functions are always greater than or equal to the functions. Similar claims cannot be made for secants. A chord of a convex function will always be greater than or equal to the function, but other points on its secant can be less than the function. For instance, in Figure 2 the chord from $x=-5$ to $x=0$ lies above the function surface, but the secant to which it belongs falls below the function surface for $x<-5$ or $x>0$. It is worth noting that tangents to nonconvex functions do not stay on one side of the surface. Figure 5 shows a nonconvex function along with one tangent (in blue) and one secant (in green). Both the tangent and the secant are above the function surface some places and below it other places.
Figure 5: Nonconvex function

In what follows, when I talk about convexity of the feasible region, I will be referring to the feasible region of the continuous relaxation (all integrality restrictions relaxed). Convexity of that region is critical to algorithms using LP bounds. There are two cases of initial interest, because they preserve that convexity. If the function $f()$ is convex and appears only in constraints of the form $f(x)+\dots\le b$ (where $\dots$ is empty or contains only convex terms), or is concave and appears only in constraints of the form $f(x)+\dots\ge b$ (where $\dots$ is empty or contains only concave terms), and if the other constraints do nothing to violate convexity, then the feasible region remains convex when $f()$ is introduced. Under any other scenario, you lose convexity of the feasible region when you introduce $f()$. So for now we will only consider convex $f()$ in $\le$ constraints or concave $f()$ in $\ge$ constraints. If $f()$ is convex or concave but appears in an inequality constraint with the wrong direction, or appears in an equation constraint, or if $f()$ is neither convex nor concave, we will deal with it later (or never, whichever comes second).

For simplicity, suppose that $f()$ is convex. We introduce a new variable $z$ that is a surrogate for $f(x)$, changing each constraint of the form $f(x)+\dots\le b$ to $z+\dots\le b$. Now we have to tie $z$ to $f(x)$, but without using the explicit constraint $z=f(x)$ (due to that whole nonlinearity thing). We can approximate the relationship either with tangents or with secants. With secants, we pick an arbitrary collection of values of $x$, calculate the corresponding values of $f(x)$, define a piecewise-linear function $\hat{f}(x)$ based on those values, and set $z=\hat{f}(x)$. In Figure 2, $\hat{f}(x)$ consists of the four chords shown (with endpoints at $x\in\left\{ -10,-5,0,5,10\right\} $).

With tangents, we have a choice. Both options involve computing tangents at an arbitrary collection of values of $x$. In one version, we find where those tangents intersect and use the intersection points to define a piecewise-linear function. In Figure 1, the five tangents shown intersect at $\left\{ (-7.5,35),(-2.5,-5),(2.5,5),(7.5,65)\right\} $. We can add endpoints for where the first and last tangents exit the interval of interest (say $(-10,80)$ and $(10,120)$ for Figure 1). In the other version, we add the constraints $z\ge\ell_{j}(x)$ for $j=1,\dots,t$, where $t$ is the number of tangents computed and $\ell_{1}(x),\dots,\ell_{t}(x)$ are the linear functions corresponding to the tangents. In Figure 1, $t=5$ and for any $x$ we are constraining $z$ to be greater than or equal to the ordinates of all five dotted lines at that $x$. If $f()$ is concave and appears in the constraint $f(x)\ge b$, the only change is that the added constraints are $z\le\ell_{j}(x)$.

At this point, there are several things we can note.
  1. Calculating chords only requires the evaluation of $f(x)$ for various values of $x$ (plus some simple algebra). Calculating tangents requires computing the first derivative of $f(x)$ when $f()$ is smooth. (More generally, it involves calculating the subgradient of $f()$ at $x$.) It then requires calculating where the tangents intersect, if we are going to make a piecewise-linear function from the tangents. So if $f()$ is painful to work with, or if sorting out a lot of intersections does not appeal to you, you might prefer chords.
  2. Chords involve embedding a piecewise-linear function in a model. As discussed in [1], this can be done using type two special ordered sets (SOS2), which turns an LP into a MIP and turns a MIP into a somewhat more complicated MIP. In another post [2], I discussed a handy feature in the CPLEX APIs that allows for easy entry of a piecewise-linear function. (In the Java API, the method is IloCplexModeler.piecewiseLinear(), which in turn implements a method from the IloMPModeler  interface.) The API method still introduces either binary variables or SOS2 constraints "under the hood", but at least the user is spared the tedious business of computing them. With tangents, the first option described does the same thing, but the second option just involves adding a bunch of linear constraints. (In CPLEX, if you calculate a lot of tangents, you might consider adding the constraints as "lazy constraints".)
  3. The way the approximations affect the solution differs between tangents and secants.
    1. In Figures 1 and 3, it is easy to see that if we replace $f(x)$ with a finite set of tangents, every feasible point remains feasible, but some infeasible points appear feasible. For instance, take the constraint $f(x)\le38$ (which becomes $z\le38$) with the function in Figure 1, redrawn below in Figure 6. The horizontal dotted line (in red) shows the limit (38) imposed by the constraint. The vertical dotted line (in cyan) shows what happens when $x=-7.5$. The black point, $(-7.5,f(-7.5))$ is above the red dotted line, indicating that $x=-7.5$ is infeasible. The blue point, $(-7.5,35)$, is however feasible in the relaxed problem, since it is on or above all the selected tangents and below the constraint limit. So the solver might accept a solution containing $x=-7.5$.
    2. In Figures 2 and 4, it is likewise easy to see that if we replace $f(x)$ with a piecewise-linear function consisting of chords, solutions that are feasible in the relaxed problem will also be feasible in the original problem, but "optimal" solutions to the relaxed problem may actually be suboptimal. The secant approximation to our convex function in Figure 2 is redrawn below in Figure 7. Staying with the constraint $z\le38$, consider what happens at $x=-7$ (the dotted vertical line). The actual function value $f(-7)=35$ is below the constraint limit, so $x=-7$ satisfies the constraint. (This is the black point on the graph.) Barring exclusion due to other constraints, it would be feasible and might be part of an optimal solution. The green point $(-7,41)$, on the piecewise-linear function defined by the secants, is not feasible, however, and so the solver will reject any solution containing $x=-7$, including possibly the true optimum.
Figure 6: Infeasible point

Figure 7: Feasible point excluded

Since tangents may produce infeasible "solutions" and secants may produce suboptimal solutions, we may need to solve the problem, get a sense of where the optimal solution might be located, refine the approximation near there and repeat. Note that if we use tangents for convex or concave functions (which produces an "outer approximation" to the original problem) and the final solution is really feasible (using the actual $f()$), there is no need to refine further. Unfortunately, with secants, I see no obvious way to know that our approximation is "good enough". With functions that are neither convex nor concave, or with convex/concave functions in equations or inequalities going the "wrong" direction, tangents no longer provide an outer approximation (Figure 5), and we may need to refine and repeat.

This brings me to my last point, which pertains to the cases where tangents do provide an outer approximation and where we add all the tangents as constraints rather than turning them into a piecewise-linear function. With solvers, like CPLEX, that allow users to post new constraints during the branch-and-bound process, we can refine on the fly. Referring back to Figure 6, suppose that I am solving the problem using CPLEX, with a lazy constraint callback attached, and CPLEX identifies a solution containing $x=-7.5$. Inside the callback, I can compute $f(-7.5)$, observe that it is greater than the constraint cutoff of $38$ (and also greater than $z=35$), calculate a new tangent at the black point, and add it as a lazy constraint. This will make the proposed new incumbent infeasible. Figure 8 shows this, with the orange dashed line being the tangent at $x=-7.5$. After adding it to our set of tangents (and adding the constraint $z\ge\ell(x)$, where $\ell(x)$ is the linear function given by the new tangent), the blue point no longer gives a possible value for $z$. At $x=-7.5$, the maximum value of any tangent will be the value of $\ell(-7.5)$, and that will equal $f(-7.5)$ since we calculated the tangent there. (It's hard to see in the figure, but trust me, the orange line goes through the black point.) So the solver will update $z$ to the correct value and, in this case, reject the solution as violating the constraint $z\le38$.
Figure 8: Adding a tangent


[1] "Linearize That!"
[2] "Piecewise Linear Functions in CPLEX"

Monday, February 17, 2020

Reversing Differences

Fellow blogger Håkan Kjellerstrand posted an interesting question on OR Stack Exchange recently. Starting from a list of integers, it is trivial to compute the list of all pairwise absolute differences, but what about going in the other direction? Given the pairwise (absolute) differences, with duplicates removed, can you recover the source list (or a source list)?

We can view the source "list" as a vector $x\in\mathbb{Z}^n$ for some dimension $n$ (equal to the length of the list). With duplicates removed, we can view the differences as a set $D\subset \mathbb{Z}_+$. So the question has to do with recovering $x$ from $D$. Our first observation kills any chance of recovering the original list with certainty:
If $x$ produces difference set $D$, then for any $t\in\mathbb{R}$ the vector $x+t\cdot (1,\dots,1)'$ produces the same set $D$ of differences.
Translating all components of $x$ by a constant amount has no effect on the differences. So there will be an infinite number of solutions for a given difference set $D$. A reasonable approach (proposed by Håkan in his question) is to look for the shortest possible list, i.e., minimize $n$.

Next, observe that $0\in D$ if and only if two components of $x$ are identical. If $0\notin D$, we can assume that all components of $x$ are distinct. If $0\in D$, we can solve the problem for $D\backslash\{0\}$ and then duplicate any one component of the resulting vector $x$ to get a minimum dimension solution to the original problem.

Combining the assumption that $0\notin D$ and the observation about adding a constant having no effect, we can assume that the minimum element of $x$ is 1. That in turn implies that the maximum element of $x$ is $1+m$ where $m=\max(D)$.

From there, Håkan went on to solve a test problem using constraint programming (CP). Although I'm inclined to suspect that CP will be more efficient in general than an integer programming (IP) model, I went ahead and solved his test problem via an IP model (coded in Java and solved using CPLEX 12.10). CPLEX's solution pool feature found the same four solutions to Håkan's example that he did, in under 100 ms. How well the IP method scales is an open question, but it certainly works for modest size problems.

The IP model uses binary variables $z_1, \dots, z_{m+1}$ to decide which of the integers $1,\dots,m+1$ are included in the solution $x$. It also uses variables $w_{ij}\in [0,1]$ for all $i,j\in \{1,\dots,m+1\}$ such that $i \lt j$. The intent is that $w_{ij}=1$ if both $i$ and $j$ are included in the solution, and $w_{ij} = 0$ otherwise. We could declare the $w_{ij}$ to be binary, but we do not need to; constraints will force them to be $0$ or $1$.

The full IP model is as follows:

\[ \begin{array}{lrlrc} \min & \sum_{i=1}^{m+1}z_{i} & & & (1)\\ \textrm{s.t.} & w_{i,j} & \le z_{i} & \forall i,j\in\left\{ 1,\dots,m+1\right\} ,i\lt j & (2)\\ & w_{i,j} & \le z_{j} & \forall i,j\in\left\{ 1,\dots,m+1\right\} ,i\lt j & (3)\\ & w_{i,j} & \ge z_{i}+z_{j}-1 & \forall i,j\in\left\{ 1,\dots,m+1\right\} ,i\lt j & (4)\\ & w_{i,j} & =0 & \forall i,j\in\left\{ 1,\dots,m+1\right\} \textrm{ s.t. }(j-i)\notin D & (5)\\ & \sum_{i,j\in\left\{ 1,\dots,m+1\right\} |j-i=d}w_{i,j} & \ge 1 & \forall d\in D & (6)\\ & z_{1} & = 1 & & (7) \end{array} \]

The objective (1) minimizes the number of integers used. Constraints (2) through (4) enforce the rule that $w_{ij}=1$ if and only if both $z_i$ and $z_j$ are $1$ (i.e., if and only if both $i$ and $j$ are included in the solution).  Constraint (5) precludes the inclusion of any pair $i < j$ whose difference $j - i$ is not in $D$, while constraint (6) says that for each difference $d \in D$ we must include at least one pair $i < j$ for that produces that difference ($j - i = d$). Finally, since we assumed that our solution starts with minimum value $1$, constraint (7) ensures that $1$ is in the solution. (This constraint is redundant, but appears to help the solver a little, although I can't be sure given the short run times.)

My Java code is available from my repository (bring your own CPLEX).

Tuesday, February 11, 2020

Collections of CPLEX Variables

Recently, someone asked for help online regarding an optimization model they were building using the CPLEX Java API. The underlying problem had some sort of network structure with $N$ nodes, and a dynamic aspect (something going on in each of $T$ periods, relating to arc flows I think). Forget about solving the problem: the program was running out of memory and dying while building the model.

A major issue was that they allocated two $N\times N\times T$ arrays of variables, and $N$ and $T$ were big enough that $2N^2T$ was, to use a technical term, ginormous. Fortunately, the network was fairly sparse, and possibly not every time period was relevant for every arc. So by creating only the IloNumVar instances they needed (meaning only for arcs that actual exist in time periods that were actually relevant), they were able to get the model to build.

That's the motivation for today's post. We have a tendency to write mathematical models using vectors or matrices of variables. So, for instance, $x_i \, (i=1,\dots,n)$ might be an inventory level at each of $n$ locations, or $y_{i,j} \, (i=1,\dots,m; j=1,\dots,n)$ might be the inventory of item $i$ at location $j$. It's a natural way of expressing things mathematically. Not coincidentally, I think, CPLEX APIs provide structures for storing vectors or matrices of variables and for passing them into or out of functions. That makes it easy to fall into the trap of thinking that variables must be organized into vectors or matrices.

Last year I did a post ("Using Java Collections with CPLEX") about using what Java calls "collections" to manage CPLEX variables. This is not unique to Java. I know that C++ has similar memory structures, and I think they exist in other languages you might use with CPLEX. The solution to the memory issue I mentioned at the start was to create a Java container class for each combination of an arc that actually exists and time epoch for which it would have a variable, and then associate instances of that class with CPLEX variables. So if we call the new class AT (my shorthand for "arc-time"), I suggested the model owner use a Map<AT, IloNumVar> to associate each arc-time combination with the variable representing it and a Map<IloNumVar, AT> to hold the reverse association. The particular type of map is mostly a matter of taste. (I generally use HashMap.) During model building, they would create only the AT instances they actually need, then create a variable for each and pair them up in the first map. When getting a solution from CPLEX, they would get a value for each variable and then use the second map to figure out for which arc and time that value applied. (As a side note, if you use maps and then need the variables in vector form, you can apply the values() method to the first map (or the getKeySet() method to the second one), and then apply the toArray() method to that collection.)

Now you can certainly get a valid model using just arrays of variables, which was all that was available to me back in the Dark Ages when I used FORTRAN, but I think there are some benefits to using collections. Using arrays requires you to develop an indexing scheme for your variables. The indexing scheme tells you that the flow from node 3 to node 7 at time 4 will be occupy slot 17 in the master variable vector. Here are my reasons for avoiding that.
  • Done correctly, the indexing scheme is, in my opinion, a pain in the butt to manage. Finding the index for a particular variable while writing the code is time-consuming and has been known to kill brain cells.
  • It is easy to make mistakes while programming (calculate an index incorrectly).
  • Indexing invites the error of declaring an array or vector with one entry for each combination of component indices (that $N\times N\times T$ matrix above), without regard to whether you need all those slots. Doing so wastes time and space, and the space, as we saw, may be precious.
  • Creating slots that you do not need can lead to execution errors. Suppose that I allocating a vector IloNumVar x = new IloNumVar[20] and use 18 slots, omitting slots 0 and 13. If I solve the model and then call getValues(x), CPLEX will throw an exception, because I am asking for values of two variables (x[0] and x[13]) that do not exist. Even if I create variables for those two slots, the exception will occur, because those two variables will not belong to the model being solved. (There is a way to force CPLEX to include those variables in the model without using them, but it's one more pain in the butt to deal with.) I've lost count of how many times I've seen messages on the CPLEX help forums about exceptions that boiled down to "unextracted variables".
So my advice is to embrace collections when building models where variables do not have an obvious index scheme (with no skips).

Thursday, January 30, 2020

Generic Callback Changes in CPLEX 12.10

CPLEX 12.10 is out, and there have been a few changes to the new(ish) generic callbacks. Rather than go into them in detail (and likely screw something up), I'll just point you to the slides for a presentation by Daniel Junglas of IBM at the 2019 INFORMS Annual Meeting.

I've written about half a dozen posts about generic callbacks since IBM introduced them (which you can find by typing "generic callback" in the search widget on the blog). A couple of things have been added recently, and I thought I would mention them. The generic callback approach uses a single callback function that can be called from a variety of contexts, including when CPLEX solves a node relaxation ("RELAXATION" context), when if finds a candidate solution ("CANDIDATE" context) and, now, when it is ready to split a node into children ("BRANCHING" context).

The branching context is one of the new features. It brings back most of the functionality of the branch callback in the legacy callback system. Unfortunately, it does not seem to have the ability to attach user information to the child nodes, which was a feature that was occasionally useful in the legacy system. You can get more or less equivalent functionality by creating a data store (array, map, whatever) in your global memory and storing the node information keyed by the unique index number of each child node. The catch is that you are now responsible for memory management (freeing up space when a node is pruned and the associated information is no longer needed), and for dealing with thread synchronization issues.

Another new feature is that you can now inject a heuristic solution (if you have one) from all three of the contexts I mentioned above. CPLEX gives you a variety of options for how it will handle the injected solution: "NoCheck" (CPLEX will trust you that it is feasible); "CheckFeasible" (CPLEX will check feasibility and ignore the solution if it is not feasible); "Propagate" (Daniel's explanation: CPLEX will "propagate fixed variables and accept if feasible"); and "Solve" (CPLEX will solve a MIP problem with fixed variables and accept the result if feasible). I assume the latter two mean that you provide a partial solution, fixing some variables but not others. (Unfortunately I was unable to make it to Daniel's talk, so I'm speculating here.)

I'm not sure if those are the only new features, but they are the ones that are most relevant to me. I invite you to read through Daniel's slides to get a more complete picture, including both the reasons for switching from legacy callbacks to generic callbacks and some of the technical issues in using them.

Tuesday, January 7, 2020

Greedy Methods Can Be Exact

We generally sort optimization algorithms (as opposed to models) into two or three categories, based on how certain we are that solutions will be either optimal or at least "good". An answer by Michael Feldmeier to a question I posted on OR Stack Exchange neatly summarizes the categories:
  • exact methods eventually cough up provably optimal solutions;
  • approximate methods eventually cough up solutions with some (meaningful) guarantee regarding how far from optimal they might be; and
  • heuristics provide no worst-case guarantees (but generally are either easy to implement, fast to execute or both).
I should explain my use of "meaningful" (which is not part of Michael's answer). A common way to estimate the "gap" between a solution and the optimum is to take $|z - \tilde{z}|/|z|$, where $z$ is the objective value of the solution produced by the algorithm and $\tilde{z}$ is some bound (lower bound in a minimization, upper bound in a maximization) of the optimal solution. Now suppose that we are minimizing a function known to be nonnegative. If we set $\tilde{s}=0$, we know that any method, no matter how stupid, will have a gap no worse than 100%. To me, that is not a meaningful guarantee. So I'll leave the definition of "meaningful" to the reader.

What brings all this to mind is a question posted on Mathematics Stack Exchange. The author of the question was trying to solve a nonlinear integer program. He approached it by applying a "greedy algorithm". Greedy algorithms are generally assumed to be heuristics, since it seldom is possible to provide useful guarantees on performance. In his case, though, the greedy algorithm is provably optimal, mainly due to the objective function being concave and separable. I'll state the problem and show a proof of optimality below (changing the original notation a bit). Brace yourself: the proof is a bit long-winded.

You start with $N$ workers to be assigned to $M$ work stations. The output of workstation $m$, as a function of the number of workers $x$ assigned to it, is given by $$f_{m}(x)=a_{m}x+b_{m}-\frac{c_{m}}{x},$$
where $a_{m},b_{m},c_{m}$ are all positive constants. Since $f(0)=-\infty$, we can assume that each work station gets at least one worker (and, consequently, that $N>M$). Since $f_{m}'(x)=a_{m}+c_{m}/x^{2}>0$, each $f_{m}()$ is monotonically increasing. Thus, we can safely assume that all $N$ workers will be assigned somewhere. $f_{m}''(x)=-2c_{m}/x^{3}<0$, so $f_{m}()$ is strictly concave (which we will need later). We also note, for future reference, that the impact of adding one worker to a current staff of $x$ at station $m$ is $$\Delta f_{m}(x)=a_{m}+\frac{c_{m}}{x(x+1)}>0.$$
Similarly, the impact of removing one worker at station $m$ is $$\delta f_{m}(x)=-a_{m}-\frac{c_{m}}{x(x-1)}<0.$$We see that $\delta f_{m}(x)$ is an increasing function of $x$ (i.e., it gets less negative as $x$ gets bigger). We also note that $\Delta f_{m}(x)=-\delta f_{m}(x+1)$.

The IP model is easy to state. Let $x_{m}$ be the number of workers assigned to work station $m$. The model is
subject to
$$\sum_{m=1}^{M}x_{m}\le N$$

The greedy algorithm starts with a single worker at each station ($x=(1,\dots,1)$) and, at each step, adds one worker to the workstation where that worker produces the greatest increase in objective value (breaking ties arbitrarily). It stops when all $N$ workers are assigned. To prove that it actually finds an optimal solution, I'll use proof by contradiction.

Let $x^{(0)},x^{(1)},\dots,x^{(N-M)}$ be the sequence of solutions constructed by the greedy algorithm, with $x^{(0)}=(1,\dots,1)$, and let $x^{(k)}$ be the last solution in the sequence for which an optimal solution $x^{*}$ exists such that $x^{(k)}\le x^{*}$. The significance of the inequality is that if $x\le x^{*}$, it is possible to extend the partial solution $x$ to the optimal solution $x^{*}$ by adding unassigned workers to work stations. We know that $k$ is well defined because $x^{(0)}\le x^{*}$ for any optimal $x^{*}$. Since we are assuming that the greedy algorithm does not find an optimum, it must be that $k<N-M$.

Now identify the work station $j$ to which the greedy algorithm added a worker at step $k$, meaning that $x_{j}^{(k+1)}=x_{j}^{(k)}+1$ and $x_{i}^{(k+1)}=x_{i}^{(k)}$ for $i\neq j$. Since, by assumption, $x^{(k)}\le x^{*}$ but $x^{(k+1)}\not\le x^{*}$, it must be that $x_{j}^{(k)}=x_{j}^{*}$.

Next, since $x^{(k)}\le x^{*}$ and $x^{(k)}\neq x^{*}$ (else $x^{(k)}$ would be optimal), there is some work station $h\neq j$ such that $x_{h}^{(k)}<x_{h}^{*}$. Let $\tilde{x}$ be the solution obtained from $x^{(k)}$ by adding a worker to station $h$: $\tilde{x}_{h}=x_{h}^{(k)}+1$ and $\tilde{x}_{i}=x_{i}^{(k)}$ for $i\neq h$. Observe that $\tilde{x}\le x^{*}$. The greedy algorithm chose work station $j$ over work station $h$ at $x^{(k)}$, so it must be that
$$\Delta f_{j}(x_{j}^{(k)})\ge\Delta f_{h}(x_{h}^{(k)}). \quad (1)$$

Finally, let $\hat{x}$ be the result of starting from optimal solution $x^{*}$ and shifting one worker from station $h$ to station $j$. Since
$$x_{j}^{(k+1)}=x_{j}^{(k)}+1=x_{j}^{*}+1=\hat{x}_{j},$$ $$x_{h}^{(k+1)}=x_{h}^{(k)}<x_{h}^{*}\implies x_{h}^{(k+1)}\le\hat{x}_{h}$$
$$x_{i}^{(k+1)}=x_{i}^{(k)}\le x_{i}^{*}=\hat{x}_{i}\,\forall i\notin\{h,j\},$$
we have $x^{(k+1)}\le\hat{x}$. Under the assumption that $x^{(k)}$ was the last solution in the greedy sequence that could be extended to an optimal solution, it must be that $\hat{x}$ is not optimal. Thus the net change to the objective function at $x^{*}$ when shifting one worker from station $h$ to station $j$ must be negative, i.e.,
$$\Delta f_{j}(x_{j}^{*})+\delta f_{h}(x_{h}^{*})<0.\quad (2)$$

We showed previously that, under our assumptions, $x_{j}^{(k)}=x_{j}^{*}$, from which it follows that
$$\Delta f_{j}(x_{j}^{*})=\Delta f_{j}(x_{j}^{(k)}). \quad (3)$$
We also showed that $\delta f_{h}()$ is an increasing function. Since $\tilde{x}_{h}\le x_{h}^{*}$,
$$\delta f_{h}(x_{h}^{*})\ge\delta f_{h}(\tilde{x}_{h})=-\Delta f_{h}(x_{h}^{(k)}). \quad (4)$$
Combining (4) with (2), we have
$$\Delta f_{j}(x_{j}^{*})-\Delta f_{h}(x_{h}^{(k)})<0,$$
$$\Delta f_{j}(x_{j}^{*})<\Delta f_{h}(x_{h}^{(k)}). \quad (5)$$

Combining (3) with (5) yields
$$\Delta f_{j}(x_{j}^{(k)})<\Delta f_{h}(x_{h}^{(k)})$$
which contradicts (1).