Monday, February 17, 2020

Reversing Differences

Fellow blogger Håkan Kjellerstrand posted an interesting question on OR Stack Exchange recently. Starting from a list of integers, it is trivial to compute the list of all pairwise absolute differences, but what about going in the other direction? Given the pairwise (absolute) differences, with duplicates removed, can you recover the source list (or a source list)?

We can view the source "list" as a vector $x\in\mathbb{Z}^n$ for some dimension $n$ (equal to the length of the list). With duplicates removed, we can view the differences as a set $D\subset \mathbb{Z}_+$. So the question has to do with recovering $x$ from $D$. Our first observation kills any chance of recovering the original list with certainty:
If $x$ produces difference set $D$, then for any $t\in\mathbb{R}$ the vector $x+t\cdot (1,\dots,1)'$ produces the same set $D$ of differences.
Translating all components of $x$ by a constant amount has no effect on the differences. So there will be an infinite number of solutions for a given difference set $D$. A reasonable approach (proposed by Håkan in his question) is to look for the shortest possible list, i.e., minimize $n$.

Next, observe that $0\in D$ if and only if two components of $x$ are identical. If $0\notin D$, we can assume that all components of $x$ are distinct. If $0\in D$, we can solve the problem for $D\backslash\{0\}$ and then duplicate any one component of the resulting vector $x$ to get a minimum dimension solution to the original problem.

Combining the assumption that $0\notin D$ and the observation about adding a constant having no effect, we can assume that the minimum element of $x$ is 1. That in turn implies that the maximum element of $x$ is $1+m$ where $m=\max(D)$.

From there, Håkan went on to solve a test problem using constraint programming (CP). Although I'm inclined to suspect that CP will be more efficient in general than an integer programming (IP) model, I went ahead and solved his test problem via an IP model (coded in Java and solved using CPLEX 12.10). CPLEX's solution pool feature found the same four solutions to Håkan's example that he did, in under 100 ms. How well the IP method scales is an open question, but it certainly works for modest size problems.

The IP model uses binary variables $z_1, \dots, z_{m+1}$ to decide which of the integers $1,\dots,m+1$ are included in the solution $x$. It also uses variables $w_{ij}\in [0,1]$ for all $i,j\in \{1,\dots,m+1\}$ such that $i \lt j$. The intent is that $w_{ij}=1$ if both $i$ and $j$ are included in the solution, and $w_{ij} = 0$ otherwise. We could declare the $w_{ij}$ to be binary, but we do not need to; constraints will force them to be $0$ or $1$.

The full IP model is as follows:

\[ \begin{array}{lrlrc} \min & \sum_{i=1}^{m+1}z_{i} & & & (1)\\ \textrm{s.t.} & w_{i,j} & \le z_{i} & \forall i,j\in\left\{ 1,\dots,m+1\right\} ,i\lt j & (2)\\ & w_{i,j} & \le z_{j} & \forall i,j\in\left\{ 1,\dots,m+1\right\} ,i\lt j & (3)\\ & w_{i,j} & \ge z_{i}+z_{j}-1 & \forall i,j\in\left\{ 1,\dots,m+1\right\} ,i\lt j & (4)\\ & w_{i,j} & =0 & \forall i,j\in\left\{ 1,\dots,m+1\right\} \textrm{ s.t. }(j-i)\notin D & (5)\\ & \sum_{i,j\in\left\{ 1,\dots,m+1\right\} |j-i=d}w_{i,j} & \ge 1 & \forall d\in D & (6)\\ & z_{1} & = 1 & & (7) \end{array} \]

The objective (1) minimizes the number of integers used. Constraints (2) through (4) enforce the rule that $w_{ij}=1$ if and only if both $z_i$ and $z_j$ are $1$ (i.e., if and only if both $i$ and $j$ are included in the solution).  Constraint (5) precludes the inclusion of any pair $i < j$ whose difference $j - i$ is not in $D$, while constraint (6) says that for each difference $d \in D$ we must include at least one pair $i < j$ for that produces that difference ($j - i = d$). Finally, since we assumed that our solution starts with minimum value $1$, constraint (7) ensures that $1$ is in the solution. (This constraint is redundant, but appears to help the solver a little, although I can't be sure given the short run times.)

My Java code is available from my repository (bring your own CPLEX).

Tuesday, February 11, 2020

Collections of CPLEX Variables

Recently, someone asked for help online regarding an optimization model they were building using the CPLEX Java API. The underlying problem had some sort of network structure with $N$ nodes, and a dynamic aspect (something going on in each of $T$ periods, relating to arc flows I think). Forget about solving the problem: the program was running out of memory and dying while building the model.

A major issue was that they allocated two $N\times N\times T$ arrays of variables, and $N$ and $T$ were big enough that $2N^2T$ was, to use a technical term, ginormous. Fortunately, the network was fairly sparse, and possibly not every time period was relevant for every arc. So by creating only the IloNumVar instances they needed (meaning only for arcs that actual exist in time periods that were actually relevant), they were able to get the model to build.

That's the motivation for today's post. We have a tendency to write mathematical models using vectors or matrices of variables. So, for instance, $x_i \, (i=1,\dots,n)$ might be an inventory level at each of $n$ locations, or $y_{i,j} \, (i=1,\dots,m; j=1,\dots,n)$ might be the inventory of item $i$ at location $j$. It's a natural way of expressing things mathematically. Not coincidentally, I think, CPLEX APIs provide structures for storing vectors or matrices of variables and for passing them into or out of functions. That makes it easy to fall into the trap of thinking that variables must be organized into vectors or matrices.

Last year I did a post ("Using Java Collections with CPLEX") about using what Java calls "collections" to manage CPLEX variables. This is not unique to Java. I know that C++ has similar memory structures, and I think they exist in other languages you might use with CPLEX. The solution to the memory issue I mentioned at the start was to create a Java container class for each combination of an arc that actually exists and time epoch for which it would have a variable, and then associate instances of that class with CPLEX variables. So if we call the new class AT (my shorthand for "arc-time"), I suggested the model owner use a Map<AT, IloNumVar> to associate each arc-time combination with the variable representing it and a Map<IloNumVar, AT> to hold the reverse association. The particular type of map is mostly a matter of taste. (I generally use HashMap.) During model building, they would create only the AT instances they actually need, then create a variable for each and pair them up in the first map. When getting a solution from CPLEX, they would get a value for each variable and then use the second map to figure out for which arc and time that value applied. (As a side note, if you use maps and then need the variables in vector form, you can apply the values() method to the first map (or the getKeySet() method to the second one), and then apply the toArray() method to that collection.)

Now you can certainly get a valid model using just arrays of variables, which was all that was available to me back in the Dark Ages when I used FORTRAN, but I think there are some benefits to using collections. Using arrays requires you to develop an indexing scheme for your variables. The indexing scheme tells you that the flow from node 3 to node 7 at time 4 will be occupy slot 17 in the master variable vector. Here are my reasons for avoiding that.
  • Done correctly, the indexing scheme is, in my opinion, a pain in the butt to manage. Finding the index for a particular variable while writing the code is time-consuming and has been known to kill brain cells.
  • It is easy to make mistakes while programming (calculate an index incorrectly).
  • Indexing invites the error of declaring an array or vector with one entry for each combination of component indices (that $N\times N\times T$ matrix above), without regard to whether you need all those slots. Doing so wastes time and space, and the space, as we saw, may be precious.
  • Creating slots that you do not need can lead to execution errors. Suppose that I allocating a vector IloNumVar x = new IloNumVar[20] and use 18 slots, omitting slots 0 and 13. If I solve the model and then call getValues(x), CPLEX will throw an exception, because I am asking for values of two variables (x[0] and x[13]) that do not exist. Even if I create variables for those two slots, the exception will occur, because those two variables will not belong to the model being solved. (There is a way to force CPLEX to include those variables in the model without using them, but it's one more pain in the butt to deal with.) I've lost count of how many times I've seen messages on the CPLEX help forums about exceptions that boiled down to "unextracted variables".
So my advice is to embrace collections when building models where variables do not have an obvious index scheme (with no skips).