OR in an OB World: 2017

Saturday, December 2, 2017

Creating A New MIME Type

I struggled a bit this afternoon creating a new MIME type and associating it with a particular application, so I'm going to archive the solution here for future reference. This was on a Linux Mint system, but I found the key information in a GNOME documentation page, so I suspect it works for Ubuntu and other systems using a GNOME-based desktop (and maybe even more generally?).

The application in question is FreeFileSync, an open-source program (that I definitely recommend) for syncing directories. I use FFS to sync mirror certain directories in my home partition to an external drive. For each such directory, I have a FFS configuration, created in the FFS GUI and stored in a separate XML file (with extension .ffs_gui).

Unlike applications installed from .deb files (where the installer handles the MIME-type stuff for you), FFS comes as an archive that you simply extract and park somewhere, so it does not create its own MIME-type or associate itself with an existing one. On my PC, Mint (with the MATE desktop environment) associated the extension .ffs_gui with XML files in general, so making FFS the default program for .ffs_gui files would have made it the default for all XML files, which is clearly undesirable. So I decided to create a MIME type for it. I also decided to store the necessary information in my home partition, rather than changing system files, so that a future system upgrade would not clobber the changes.

Google searches produced a variety of (rather complicated) solutions, one or two of which I tried but failed to get to work. Fortunately, the steps on the GNOME page mostly did work. First, I created ~/.local/share/mime/packages/application-x-freefilesync.xml as instructed, naming the new MIME type application/x-freefilesync and setting the glob pattern to *.ffs_gui. Second, I created a new desktop file (~/.local/share/applications/freefilesync.desktop) using the new MIME type and pointing to the FFS executable. Finally, I ran

update-mime-database ~/.local/share/mime

and

update-desktop-database ~/.local/share/applications

(as myself, not as root).

The GNOME page suggests using the gio program to verify the results, but I don't have that installed on my system. What did work was to open a terminal in the directory where the FFS configuration files were stored, pick one (say, foo.ffs_gui) and run

xdg-mime query filetype foo.ffs_gui

to query its MIME-type, confirming that the result was application/x-freefilesync.

After that, I right-clicked foo.ffs_gui in the file manager (Caja, on my system), chose Properties > Open With, added FFS and made it the default for all .ffs_gui files, and that was that.

Monday, November 13, 2017

Benders Decomposition with Generic Callbacks

Brace yourself. This post is a bit long-winded (and arguably geekier than usual, which is saying something). Also, it involves CPLEX 12.8, which will not ship until some time next month.

I have an updated version of an old example, solving a fixed charge transportation problem using Benders decomposition. The example (using Java, naturally) is scattered across several posts, going back a ways:

the original example was posted in 2012;
an updated version was posted in 2014; and
when IBM added automated Benders decomposition to CPLEX 12.7 in 2016, I revised the example yet again.

Now here I am breaking the even year pattern by doing yet another version in 2017. If you want to see the MIP model written out in mathematical notation, it's in the 2012 post. If you want to see the code for the most recent version, which contains both a "manual" decomposition (using callbacks) and a few versions of the new (in 12.7) automatic decomposition feature, there's a link to the repository in the 2016 post. At this point in time, the 2014 post is not worth reading.

Before I get into the updated example code, let me outline some key differences between the old ("legacy") callback system and the new ("generic") callback system.

When is a callback called by CPLEX?

Legacy: This is dictated by the type of callback. A branch callback is called when CPLEX is ready to split a node into children, a lazy constraint callback is called when CPLEX identifies what it thinks is a new integer-feasible solution, and so on.
Generic: When you attach your callback to the model (IloCplex instance), you specify via a bit mask in which of several "contexts" it should be called. As of this writing (things do have a habit of changing), the contexts are Candidate (a new integer-feasible candidate for incumbent solution has been found), GlobalProgress (progress has occurred at the "global" level -- presumably something you might see in the node log), LocalProgress (a thread has made what it thinks is progress, but has not yet passed it back to the controller), Relaxation (CPLEX has found a solution that might not be integer-feasible, such as the solution to a node LP), ThreadDown (CPLEX deactivated a thread) and ThreadUp (CPLEX activated a thread).

If you are fuzzy about the difference between local and global progress, one example is a thread that finds a new incumbent solution, one that is better than the incumbent that existed when the thread was launched, but not as good as the current incumbent (recently found by a different thread). That is local progress (in the first thread) but not global progress (the incumbent found by the first thread will be ignored when it gets around to reporting it to the controller).

Attaching callbacks to models:

Legacy: You attach separate callbacks of each desired type (info callback, branch callback, user cut callback, ...).
Generic: You attach a single callback that handles all supported operations. Attaching a second callback automatically detaches the first one. When attaching the callback, you specify via a bitmap from which context(s) the callback is to be called.

Callback classes:

Legacy: Your class extends one of the existing callbacks. For instance, if you want a lazy constraint callback, you extend the IloCplex.LazyConstraintCallback class and override the abstract main() method.
Generic: Your class implements the IloCplex.Callback.Function interface and overrides the abstract invoke() method.

Calling functions inside the callback:

Legacy: You call one of the methods for the class you extended. For example, in a branch callback (extending IloCplex.BranchCallback), you would call one of the overloads of makeBranch() to create a new branch.
Generic: When CPLEX calls the invoke() method, it passes in as an argument an instance of IloCplex.Callback.Context. This object has a number of methods you can invoke to find out the context, access a candidate solution (if one exists), get information about what is going on (including the index of the thread in which it was invoked, and the total number of threads in use), etc. Basically, this is your key to the castle.

Benders decomposition

For Benders decomposition, we will be focused on the Candidate context (CPLEX thinks it has an integer-feasible solution, which we need to test) and the ThreadUp and ThreadDown contexts, for reasons I'll eventually get to.

I wrote about thread safety in my previous post. Let me pin down why this is critical when using Benders decomposition. With Benders, we have a master problem and one or more subproblems. (I'll stick with one subproblem here, for concreteness.) The master problem model itself is never modified. Benders cuts are added to the working copy of the master problem, from the callback. Several threads could be generating Benders cuts more or less simultaneously, but it is the responsibility of CPLEX (not your code) to make sure that multiple threads adding cuts don't trip over each other.

On the other hand, the way we generate cuts is to use the proposed incumbent solution to modify the constraint limits in the subproblem (or the subproblem objective, if you're one of those people who prefers to work with the dual of the actual subproblem). It's our code (specifically, the callback) making those changes, and if changes are being made concurrently by multiple threads, bad things can happen. Just to verify this is not hypothetical, I converted my original Benders code to the new callback system and ran it without doing anything special regarding thread safety. It crashed the JVM in a matter of seconds. Lesson learned.

So what are our options? One is to use a single copy of the subproblem and impose a locking mechanism that prevents more than one thread from accessing the subproblem at a time. That turns out to be the easiest to program in Java (at least in my opinion), but it slows the solver down due to threads being blocked. What seems to be the general wisdom for using locks is that you should only use them on chunks of code that execute quickly. Modifying what might be a large LP model, solving it, and finding a dual ray or Farkas certificate if it is infeasible or the dual solution if it is feasible does not meet the "execute quickly" criterion.

Another option is to generate an independent copy of the subproblem each time a thread needs it. That seems wastefully slow to me. A third, more promising approach is to generate one copy of the subproblem per thread and store it, reusing it each time the callback is called in that thread. This again eats CPU cycles generating the subproblem multiple times, but once per thread is a lot less than once each time CPLEX calls the callback. A downside, however, is that this could consume a prohibitive amount of memory if the subproblem is quite large.

The revised Benders example

The latest version of my Benders example is available at https://gitlab.msu.edu/orobworld/BendersExample3. In addition to my code (and Java), you will need CPLEX 12.8 (which I am told is coming out in December) and the open-source Apache Commons CLI library (for processing command line options).

I've added some command line options to let you alter a number of things without having to alter or comment out code: which models to run; what random seed to use; how many trials to run (default 1); whether to show CPLEX log output for neither problem, just the master, or both master and subproblem; whether to show the formulas of the cuts generated (or just settle for messages telling you when a cut is added); and whether to print the full solution (which warehouses are used, what the shipment volumes are). Run the program with "-h" or "--help" on the command line to get the full list of options.

The new code has two versions of the "manual" (callback-based) Benders method. The "-m" command line option runs the default "manual" model, while the "-m2" option runs what I inelegantly named the "thread-local manual" model. The difference is in the approach taken to assure thread safety. I'll briefly describe that below, and of course you can pore over the code if you want more details.

Locking

The default manual model avoids collisions between threads by locking parts of the code so that one thread cannot execute certain portions when any other thread is executing certain portions. There are multiple ways to do this in Java, but a simple one (which I use here) is to synchronize methods. To do this, all you have to do is add the keyword synchronized to the method declarations. (In addition to synchronizing methods, you can synchronize smaller chunks of code, but I did not use that feature.)

My ManualBenders class has a private inner class named BendersCallback. The callback is entered only from the Candidate context. Its sole method is the invoke() method mandated by the IloCplex.Callback.Function interface described earlier. By declaring that method

public synchronized void invoke(final IloCplex.Callback.Context context)

I ensure that no thread can invoke the callback while another thread is in the callback. The good news is that it works (prevents threads from interfering with each other) and it's incredibly easy to do. The bad news is that it results in threads blocking other threads. The subproblem in the example is pretty small. If it were slow to solve (or large enough to make updating bounds meaningfully slower), performance would be even worse.

Thread-specific copies of the subproblem

The "thread-local" model avoids thread collisions by giving each thread a separate copy of the subproblem to play with. I created a class (cleverly named Subproblem) to hold instances of the subproblem. The callback (instance of class BendersCallback2) is now entered from three different contexts, ThreadUp, ThreadDown and Candidate. As before, in the Candidate context we do the usual things: use the candidate solution to the master problem to modify constraint limits in the subproblem; solve the subproblem; depending on whether it is feasible, and what its optimal objective value is, either accept the new incumbent or reject it by adding a cut. We use calls in the ThreadUp context to create and store a copy of the subproblem for the newly activated thread, and similarly we use calls in the ThreadDown context to delete the local subproblem copy and recover its memory.

So where do we store subproblem copies so that only the thread invoking the callback can see its designated copy? In their BendersATSP2.java example code, IBM creates a vector, of length equal to the maximum number of threads that will be used, to store subproblems. The context argument to the callback provides a method, context.getIntInfo(IloCplex.Callback.Context.Info.ThreadId), that returns the zero-based index number of thread invoking the callback, which is used to pluck the correct copy of the subproblem from the vector of copies.

I took a slightly different approach in my code, using a generic Java class named ThreadLocal that exists precisely to hold objects local to a given thread. My BendersCallback2 class contains a private field declared

private final ThreadLocal<Subproblem> subproblem

to hold the copy of the subproblem belonging to a given thread. When the callback is called from the ThreadUp context, subproblem.set() is used to store a newly constructed copy of the subproblem; when called from the Candidate context, subproblem.get() is used to access the copy of the subproblem belonging to the thread. (Since this is all rather new, both to CPLEX and to me, and since I have a suspicious nature, I stuck some extra code in the example to ensure that subproblem is empty when called from ThreadUp and not empty when called from Candidate or ThreadDown. There are comments indicating the superfluous code. Feel free to omit it from your own projects.)

I thought that when a thread was deleted, the subproblem variable would be deleted. Since subproblem is (I think) the only pointer to the Subproblem instance for that thread, I expected the memory used by the Subproblem to be recovered during garbage collection. Apparently I was wrong: my initial attempt leaked memory (enough to shut down the JVM after solving a few problems). So I added an invocation of the callback from the ThreadDown context (thread deactivation), at which point the callback deletes the Subproblem instance. That seems to have cured the memory leak. So, in summary, the callback is called in three contexts:

ThreadUp -- create a copy of the subproblem for the thread;
Candidate -- test a candidate incumbent and, if necessary, generate a Benders cut; and
ThreadDown -- delete the copy of the subproblem belonging to that thread.

I'm comfortable with this approach, but you might prefer the subproblem vector used in the BendersATSP2.java example.

Performance Comparison

I ran five solution methods (Benders with locking, Benders with copies of the subproblem for each thread, and three versions of the automatic Benders decomposition added to CPLEX 12.7) on 10 instances of the fixed charge transportation problem. That's a small sample, based on a single type of problem, on a single machine (four cores, 4 GB of RAM), so I wouldn't read too much into the results. Nonetheless, I thought I would close by comparing solution times. The three automatic methods provided by CPLEX ("annotated", "automatic" and "workers", described in my 2016 post) had virtually indistinguishable times, so to make the plot more reasonable I am including only the "automatic" method (labeled "Auto"), along with my two approaches ("Local" for the method using copies of the subproblem for each thread and "Locked" for the method using synchronization).

The plot below shows, as a function of wall clock time, the fraction of trials that reached proven optimality by the specified time. The automatic approach and my manual approach with thread-local subproblems were competitive (with the automatic approach having a slight advantage), while the manual approach with synchronized class methods lagged rather badly. Although I cannot say with certainty, I am rather confident that is a result of threads engaged in solving a subproblem blocking other threads wanting to do the same.

performance plot (completion rate versus run time, by method)

Finally, just to drive home the realities of multiple threads, here is a sample of the output generated by the synchronized (locked) approach during one run.

>>> Adding optimality cut
>>> Accepting new incumbent with value 3598.1162756564636
>>> Accepting new incumbent with value 3598.1162756564636
>>> Accepting new incumbent with value 4415.750134379851
>>> Accepting new incumbent with value 4415.750134379851
>>> Accepting new incumbent with value 4411.334611014113
>>> Accepting new incumbent with value 4411.334611014113
>>> Accepting new incumbent with value 4700.008131082763
>>> Accepting new incumbent with value 4700.008131082763
>>> Adding optimality cut
>>> Adding optimality cut
>>> Adding optimality cut
Final solver status = Optimal
# of warehouses used = 14
Total cost = 3598.1162756564636
Solution time (seconds) = 143.090

The objective here is minimization. The messages beginning ">>>" are printed by the callback. Notice that after the optimal solution is found and accepted (in two different threads?), provably suboptimal solutions (solutions with inferior objective values) appear to be accepted several times. This also happens with the thread-local approach. It may well be happening in the various automated Benders approaches, but since they do not print progress messages there is no way to tell. This illustrates the difference between "local" and "global" progress. It happens because the various worker threads do not communicate with each other, and communicate only sporadically with the controller thread. So if thread 3 finds the optimal solution and has it accepted, thread 1 does not know this and continues cranking through its portion of the search tree, finding new "incumbents" that are improvements over the best solution thread 1 has seen but not necessarily improvements over the best global solution. The inferior solutions are accepted in the threads in which they were found, but the controller knows not to accept them as incumbents (because it already has a better incumbent). This behavior is not an error; it's just another indication why adding threads does not improve performance proportionally.

Saturday, November 4, 2017

Thread Safety

As I noted in yesterday's post, one of the major changes associated with the new "generic" callback structure in CPLEX is that users now bear the responsibility of making their callbacks thread-safe. As I also noted yesterday, this is pretty new stuff for me. So I'm going to try to share what I know about thread safety, but bear in mind that I don't know all that much (and don't know what I don't know). In a subsequent post, I'll share an updated example of Benders decomposition using the new callback structure.

What is a thread?

I'll refer you to the Wikipedia for a detailed definition. Basically, a thread is a chunk of code that is set up to be run asynchronously. Typically, the point of creating a thread is to allow the code to run in parallel with other parts of the parent program. The operating system swaps threads in and out of processor cores. Part of the reason for creating additional threads is to exploit the presence of multiple processor cores, but that's not the only reason. Consider a program designed to do some computationally intensive operation (such as solving an integer program) and assume the program has a graphical user interface (GUI). Chances are the GUI runs in a different thread from the computational portion of the program, even if it is running on a single-core processor. Otherwise, with everything in one thread, the GUI would become unresponsive for the entire time the program was crunching numbers. Among other things, that would make it impossible to use a GUI control or menu item to abort the computation if, say, you were jonesing to check your Twitter feed.

Before continuing, I think it's worth noting three things here. The first is that, in an era of multi-core computer processors, multithreaded applications are increasingly common, and increasingly attractive. CPLEX defaults to using multiple threads when solving optimization problems, although you can control the number threads used (including throttling it to a single thread) via a parameter setting. Second, performance improvement due to multithreading is sublinear. If you go from one thread to two, the reduction in execution time is less than 50%. If you go from one thread to four, you will likely not see anywhere near a 75% reduction in processing time. This is partly due to added overhead required to set up and manage threads, and partly to the fact that threads can get in each others' way, jostling for CPU time and blocking progress of their siblings. Finally, making an application multithreaded may increase memory use, because you may need to make separate copies of some data for each thread.

What is thread safety?

Again, I'll refer you to a Wikipedia definition. The key concepts, I think, is that you want to defend against a couple of dangers. The first is that threads might block each other. Deadlock can occur when one thread is waiting for another thread to do something but simultaneously blocking the second thread from doing it (say, by hogging a resource). That's actually an oversimplification, in that more than two threads can be contributing to a deadlock. Starvation occurs when some thread cannot access the resources it needs because other threads (several different threads, or one thread over and over) keep blocking it.

The second danger, which helps explain how things like starvation or deadlock can come about, is the danger that one thread writes shared data while another thread is reading or using it. For instance, consider Benders decomposition. (This is not a hypothetical example: you'll see it in the next post, if you stick around.) Assume, as is usually the case, a MIP master problem and a single LP subproblem, and assume we are using a callback to test proposed integer feasible solutions coming from the master problem. When the solver thinks it has a new incumbent for the master problem, it calls the callback. The callback uses the proposed incumbent to adjust some constraint limits in the subproblem, solves the subproblem, and based on the result either accepts the incumbent or cuts it off with a Benders cut.

Now suppose that two different threads, A and B, both get (different) proposed incumbents, with A beginning processing a little ahead of B. A modifies the LP subproblem and tries to solve it, but before it gets to the point of calling the solver B (on a different core) starts modifying the subproblem. So when A calls the LP solver, the subproblem it is solving has a mix of some modifications A made and some modifications B made. At best, A ends up solving the wrong LP (and not knowing it). At worst, B is modifying the LP while A is trying to solve it. With CPLEX, at least, if this happens CPLEX throws an exception and the program likely grinds to a halt.

How do we make code thread-safe?

Good question. I don't know all the methods, but there are two fundamental techniques that I do know. The first is locks. Basically, locks are semaphores (flags, signals, whatever) that tell the system not to let any other thread touch some part of memory until the thread owning the lock is done. You write your code so that the part that will run concurrently locks shared objects, does what it needs with them, and then unlocks them. On the one hand, it's important to lock everything that needs locking. Missing one item is like trying to burglar-proof your home but then leaving the back door wide open. On the other hand, it's important to lock only what has to be locked, and only for as long as it needs to be locked. Hanging on to a lock for too long can block other threads, and hurt performance.

The second technique is to avoid contention for data by giving each thread its own personal copy of the data. Thread-safety varies from language to language. In Java, each thread gets its own stack, containing local variables, and no thread can touch another thread's stack. So, for instance, if a thread starts a loop indexed by local variable i, there is no danger that i is touched by another thread. On the other hand, Java objects are parked in the heap, and are available to any thread that knows the addresses of the objects. So to avoid collisions between threads, you can either copy the original data to the stack for each thread and let the thread mess with its own copy, or (if the data is a Java object) create a clone of the original object for each thread. The clone will live in the heap, but only the thread for which it was created will know its address, so no other thread will screw with it.

My modified Benders example (subject of a future post, after CPLEX 12.8 ships) will demonstrate both approaches.

If you are a Java programmer, and if multithreading is new to you, I recommend you look at Oracle's tutorial on concurrency in Java. It is written well, contains examples of things that can go wrong, and covers much if not all of what you need to know to handle thread safety while working with CPLEX generic callbacks.

Friday, November 3, 2017

CPLEX 12.8: Generic Callbacks

IBM is getting ready to release CPLEX 12.8, and I had the opportunity to attend a presentation about by Xavier Nodet at the 2017 INFORMS annual meeting. Here are links to two presentations by Xavier: CPLEX Optimization Studio 12.8 - What's New and CPLEX 12.8 - the Generic Callback. As with any new release, there are multiple tweaks, performance improvements and some new features. What I want to focus on in this post and the next post or two (depending on how wordy I get) is a major change to the way CPLEX implements callbacks.

I'm going to assume any reader continuing beyond this point knows what a callback is (and, for that matter, knows what CPLEX is). The callback system used through CPLEX version 12.7.1 is now being referred to as the "legacy" callback system. A new system, beginning in version 12.8, is called the "generic" callback approach. I'll outline what I think are the key take-aways here, but if you are a callback user, I strongly suggest you check out Xavier's slide show linked above.

The first thing to know is that, for at least the near future, CPLEX will continue to support legacy callbacks. That's important for two reasons: it may take time to adjust to the new generic callbacks (and to port code to the new system); and not everything you could do with legacy callbacks is available with generic callbacks (more on that below). The second thing to know is that you cannot mix legacy and generic callbacks in the same program. You have to use one or the other exclusively (or neither).

The new generic callback approach involves trade-offs. The legacy system involves essentially two types of callbacks: "information" callbacks (asking CPLEX what's going on at various points in the solution process) and "control" callbacks (meddling with how CPLEX solves a problem). Changes between how CPLEX worked back when control callbacks were introduced and how it works now created some adventures for both IBM and users. For instance, certain control callbacks were incompatible with dynamic search, so the presence of such a callback (even if it contained no code and did nothing) would cause CPLEX to turn off dynamic search. There were some other side effects, but I'm neither qualified nor motivated to list them all. Suffice it to say that IBM decided a revamped callback system, designed from the ground up to work fully with the current version of CPLEX, was warranted.

So here come the trade-offs. In exchange for the ability to use callbacks without having to turn off dynamic search, limit CPLEX to a single thread, or whatever, you lose some functionality. Generic callbacks are currently not implemented for continuous problems. (Again, you can still use legacy callbacks on those.) Generic callbacks do not support user-controlled branching or node selection, nor user solution of node problems. A side effect of losing the branch callbacks is that users apparently can no longer attach node data to nodes and query it later (since, in the legacy system, the node data was attached in a branch callback). If you need any of those features, you have to stick to legacy callbacks for now (while you lobby IBM and/or your congressman to get your favorite features added to the generic callback system). Apparently, data IBM has accumulated on which features are or are not widely used suggested that omitting branch and solve callbacks would inconvenience a small portion of users.

Finally, here comes what I think is potentially the biggest issue: it is now the responsibility of the user to ensure that a generic callback is thread-safe. This is not always easy to do, and it requires a depth of programming knowledge that you don't typically get in a first or maybe even second course on programming. To put it another way, I've been writing code for well over 45 years (going back to FORTRAN on a mainframe), and my experiments with the beta version of 12.8 represent the first time I've ever had to handle thread safety. In my next post (or two, depending on how chatty my muse is), I'm going to look specifically at thread safety. Meanwhile, if you're not comfortable with it, my advice is to stick to legacy callbacks (but start learning about writing thread-safe code -- soon or later you'll need to know how).

Friday, September 1, 2017

Minimizing a Median

$ \def\xorder#1{x_{\left(#1\right)}} \def\xset{\mathbb{X}} \def\xvec{\mathbf{x}} $A somewhat odd (to me) question was asked on a forum recently. Assume that you have continuous variables $x_{1},\dots,x_{N}$ that are subject to some constraints. For simplicity, I'll just write $\xvec=(x_{1},\dots,x_{N})\in\xset$. I'm going to assume that $\xset$ is compact, and so in particular the $x_{i}$ are bounded. The questioner wanted to know how to model the problem of minimizing the median of the values $x_{1},\dots,x_{N}$. I don't know why that was the goal, but the formulation is mildly interesting and fairly straightforward, with one wrinkle.

The wrinkle has to do with whether $N$ is odd or even. Suppose that we sort the components of some solution $\xvec$, resulting in what is sometimes called the "order statistic": $\xorder 1\le\xorder 2\le\dots\xorder N$. For odd $N$, the median is $$\xorder{\frac{N+1}{2}}.$$For even $N$, it is usually defined as $$\left(\xorder{\frac{N}{2}}+\xorder{\frac{N}{2}+1}\right)/2.$$

The odd case is easier, so we'll start there. Introduce $N$ new binary variables $z_{1},\dots,z_{N}$ and a new continuous variable $y$, which represents the median $x$ value. The objective will be to minimize $y$. In addition to the constraint $\xvec\in\xset$, we use "big-M" constraints to force $y$ to be at least as large as half the sample (rounding "half" up). Those constraints are: \begin{align*} y & \ge x_{i}-M_{i}z_{i},\ i=1,\dots,N\\ \sum_{i=1}^{N}z_{i} & =\frac{N-1}{2} \end{align*} with the $M_{i}$ sufficiently large positive constants. The last constraint forces $z_{i}=0$ for exactly $\frac{N+1}{2}$ of the indices $i$, which in turn forces $y\ge x_{i}$ for $\frac{N+1}{2}$ of the $x_{i}$. Since the objective minimizes $y$, $z_{i}$ will be 0 for the $\frac{N+1}{2}$ smallest of the $x_{i}$ and $y$ will be no larger than the smallest of them. In other words, we are guaranteed that in the optimal solution $y=\xorder{\frac{N+1}{2}}$, i.e., it is the median of the optimal $\xvec$.

If $N$ is even and if we are going to use the standard definition of median, we need twice as many added variables (or at least that's the formulation that comes to mind). In addition to the $z_{i}$, let $w_{1},\dots,w_{N}$ also be binary variables, and replace $y$ with a pair of continuous variables $y_{1}$, $y_{2}$. The objective becomes minimization of their average $(y_{1}+y_{2})/2$ subject to the constraints \begin{align*} y_{1} & \ge x_{i}-M_{i}z_{i}\ \forall i\\ y_{2} & \ge x_{i}-M_{i}w_{i}\ \forall i\\ \sum_{i=1}^{N}z_{i} & =\frac{N}{2}-1\\ \sum_{i=1}^{N}w_{i} & =\frac{N}{2} \end{align*} where the $M_{i}$ are as before. The constraints force $y_{1}$ to be at least as large as $\frac{N}{2}+1$ of the $x_{i}$ and $y_{2}$ to be at least as large as $\frac{N}{2}$ of them. The minimum objective value will occur when $y_{1}=\xorder{\frac{N}{2}+1}$ and $y_{2}=\xorder{\frac{N}{2}}$.

Monday, August 21, 2017

Updated Stepwise Regression Function

Back in 2011, when I was still teaching, I cobbled together some R code to demonstrate stepwise regression using F-tests for variable significance. It was a bit unrefined, not intended for production work, and a few recent comments on that post raised some issues with it. So I've worked up a new and (slightly) improved version of it.

The new version is provided in an R notebook that contains both the stepwise function itself and some demonstration code using it. It does not require an R libraries besides the "base" and "stats" packages. There is at least one butt-ugly hack in it that would keep me from being hired in any sort of programming job, but so far it has passed all the tests I've thrown at it. If you run into issues with it, feel free to use the comment section below to let me know. I'm no longer teaching, though, so be warned that maintenance on this is not my highest priority.

The updated function has a few new features:

it returns the final model (as an lm object), which I didn't bother to do in the earlier version;
you can specify the initial and full models as either formulas (y~x+z) or strings ("y~x+z"), i.e., quotes are strictly optional; and
as with the lm function, it has an optional data = ... argument that allows you to specify a data frame.

There are also a few bug fixes:

if you set the alpha-to-enter greater than the alpha-to-leave, which could throw the function into an indefinite loop, the function will now crab at you and return NA;
if you try to fit a model with more parameters than you have observations, the function will now crab at you and return NA; and
the function no longer gets confused (I think) if you happen to pick variable/column names that happen to clash with variable names used inside the function.

As always, the code is provided with a Creative Commons license, as-is, no warranty express or implied, your mileage may vary.

Update (11/09/18): I've tweaked the code to add a few features. See here for a post about the updates.

Monday, August 7, 2017

Rolling Horizons

I keep seeing questions posted by people looking for help as they struggle to optimize linear programs (or, worse, integer linear programs) with tens of millions of variables. In my conscious mind, I know that commercial optimizers such as CPLEX allow models that large (at least if you have enough memory) and can often solve them (at least if you have enough computing resources to throw at the problems). My lizard brain, however, was conditioned by the state of the art in the late '70s to send me running (typically while screaming) at the sight of a model with more than, oh, 50 or so variables. Wrapping my head around tens of millions of variables, let alone thinking of a strategy for getting an optimal solution "quickly", is quite a struggle.

A former acquaintance from my student days once articulated his strategy for essay exams to me: if you don't know the answer, argue the premise of the question. In that vein, I'm inclined to ask why so many variables are necessary. One possibility is that the model captures decisions for a sequence of time periods. If decisions in one time period had no impact on subsequent periods, the problem would decompose naturally into a bunch of smaller problems; so if we are talking about a multiperiod model, it's safe to assume that the periods are connected.

That brings me to a strategy I learned back in primitive times, the "rolling horizon" approach. Let me stipulate up front that this is a heuristic method. It does not provide a provably optimal solution for the entire time horizon. Still, given the up-tick in humongous models, I'm starting to wonder if rolling horizons are no longer taught (or are looked down on).

The basic concept is simple. The devil is in the details. Let's say we want to solve a planning model over a horizon of $H$ time periods, and that one omnibus model for the entire horizon is proving intractable. The rolling horizon approach is as follows.

Pick a shorter horizon $K < H$ that is tractable, and a number of periods $F \le K$ to freeze.
Set "boundary conditions" (more on this below).
Solve a model for periods $1,\dots,K$, incorporating the boundary conditions.
Freeze the decisions for periods $1,\dots,F$.
Solve a model for periods $F+1,\dots, \min(F+K, H)$.
Repeat ad nauseam.

Note that if $F<K$, some decisions made in each solution but not frozen will be subject to revision in the next model.

The initial conditions (starting inventory, locations of vehicles, pending orders, ...) for each model are dictated by the state of the system after the last frozen period. The boundary conditions are limits on how much of a mess you can leave at the end of the reduced planning horizon (period $K$ in the first model, $K+F$ in the second model, etc.). More precisely, they limit the terminal state of things that will be initial conditions in the next model.

As a concrete example, consider a manufacturing scheduling model. You start with inventories of components and finished products, available capacity of various kinds, and unfilled orders, and you end with the same kinds of things. Without boundary conditions, your solution for the first model might end period $K$ with no finished goods inventory. Why make stuff if it doesn't count toward your demand within the time frame considered by the model? That might make the problem for periods $K+1$ onward infeasible, though: you have orders that must be filled, they exceed your immediate production capacity, and the cupboard was left bare by the previous solution.

So you want to add constraints of the form "don't leave me with less than this much inventory on hand" or "don't leave me with more than this many unfilled orders". Picking values for those limits is a bit of an art form. Make the limits too draconian and you will get a very suboptimal solution (say, piling up way more inventory than you'll really need) or possibly even make the early subproblems infeasible. Make the limits too laissez faire, and you may force thoroughly suboptimal solutions to later subproblems (piling on lots of expensive overtime, blowing off soon-to-be-former customers) or even make the later problems infeasible. Still, it's usually possible to come up with pretty reasonable boundary conditions, perhaps with some trial and error, and it's a way to get a solution that, if not globally optimal, is at least good (and preferable to staring bleary-eyed at your screen in the 30th hour of a run, wondering if the beast will ever converge).

The term "rolling horizon" was coined in reference to problems that are decomposed based on time, but the same concept may extent to problems that can be decomposed based on position. I used it not long ago to get a heuristic for a problem placing nodes in a physical network. In essence, we took the original geographic region and chopped it into pieces, picked a piece to start from, and then worked our way outward from that piece until all the pieces had been solved, each based on the (frozen) solution to the more inward pieces.

Friday, July 14, 2017

Memory Minimization

As I grow older, I'm starting to forget things (such as all the math I ever learned) ... but that's not the reason for the title of this post.

A somewhat interesting question popped up on Mathematics StackExchange. It combines a basic sequencing problem (ordering the processing of computational tasks) with a single resource constraint (memory) and a min-max criterion (minimize the largest amount of memory consumed anywhere in the sequence). What differentiates the problem from other resource-constrained scheduling problems I've seen is that the output of each task eats a specified amount of memory and must stay there until every successor task that needs it as input completes, after which the memory can be recycled. In particular, the output of the last task using the result of task $n$ will coexist momentarily with the output of $n$, before $n$'s output can be sent to the bit-recycler. Also, there is no time dimension here (we neither know nor care how much CPU time tasks need), just sequencing.

The data for the problem is a precedence graph (directed acyclic graph, one node for each task) plus an integer weight $w_n$ for each task $n$, representing the memory consumption of the value computed by task $n$. There are several ways to attack a problem like this, including (mixed) integer programming (MIP), constraint programming (CP), and all sorts of metaheuristics. MIP and CP guarantee optimal solutions given enough resources (time, memory, computing power). Metaheuristics do not guarantee optimality, but also do not require commercial software to implement and may scale better than MIP and possibly CP.

I thought I'd publish my MIP model for the problem, scaling be damned.

Index sets and functions

Let $N$ be the number of tasks (nodes).

I will use $V=\left\{ 1,\dots,N\right\}$ as the set of tasks and $S=\left\{ 1,\dots,N\right\}$ as the set of schedule slots. (Yes, I know they are the same set, but the dual notation may help distinguish things in context.)
$A\subset V\times V$ will represent the set of arcs, where $(u,v)\in A$ means that task $v$ takes the output of task $u$ as an input.
$\pi:V\rightarrow2^{V}$ and $\sigma:V\rightarrow2^{V}$ will map each task to its immediate predecessors (the inputs to that task) and immediate successors (tasks that need its value as inputs) respectively.

Parameters

The only parameter is $w:V\rightarrow\mathbb{N}$, which maps tasks to their memory footprints.

Variables

There are a few ways to express assignments in MIP models, the two most common of which are "does this critter go in this slot" and "does this critter immediately follow this other critter". I'll use the former.

$x_{ij}\in\left\{ 0,1\right\} \ \forall i\in V,\forall j\in S$ will designate assignments, taking the value 1 if task $i$ is scheduled into slot $j$ in the sequence and 0 otherwise.
$r_{ij}\in\left[0,1\right]\ \forall i\in V,\forall j\in S$ will take the value 1 if the output of task $i$ is resident in memory immediately prior to performing the computational task in slot $j$ and 0 otherwise.
$d_{ij}\in\left[0,1\right]\ \forall i\in V,\forall j\in S$ indicates whether all tasks requiring the output of task $i$ have been completed prior to the execution of the task in slot $j$ (1) or not (0).
$z>0$ is the maximum memory usage anywhere in the schedule.

We can actually winnow a few of these variables. If task $i$ has any predecessors, then clearly $x_{i1}=0$, since there is no opportunity before the first slot to execute a predecessor task. Similarly, $d_{i1}=0\,\forall i\in V$, since nothing has been completed prior to the first slot.

You may think at this point that you have spotted a typo in the second and third bullet points (brackets rather than braces). You haven't. Read on.

Objective

The easiest part of the model is the objective function: minimize $z$.

Constraints

The assignment constraints are standard. Every task is assigned to exactly one slot, and every slot is filled with exactly one task. $$\sum_{j\in S}x_{ij}=1\quad\forall i\in V$$ $$\sum_{i\in V}x_{ij}=1\quad\forall j\in S$$ Precedence constraints are also fairly straightforward. No task can be scheduled before each of its predecessors has completed.$$x_{ij}\le\sum_{k <j}x_{hk}\quad\forall i\in V,\forall j\in S,\forall h\in\pi(i)$$ Next, we define memory consumption. Immediately after executing the task in any slot $j$, the memory footprint will be the combined consumption of the output of that task and the output of any earlier task still in memory when slot $j$ executes.$$z\ge\sum_{i\in V}w(i)\left(r_{ij}+x_{ij}\right)\quad\forall j\in S$$ Defining whether the output of some task $i$ is no longer needed prior to executing the task in some slot $j>1$ is fairly straightforward.$$d_{ij}\le\sum_{k<j}x_{hk}\quad\forall i\in V,\forall 1<j\in S,\forall h\in\sigma(i)$$As noted above, there is no "before slot 1", so we start this set of constraints with $j=2$.

Finally, the result of task $i$ remains in memory at the point where the task in slot $j$ executes if task $i$ executed before slot $j$ and the memory has not yet been freed.$$ r_{ij}\ge\sum_{k<j}x_{ik}-d_{ij}\quad\forall i\in V,\forall 1<j\in S$$Again, this is nonsensical for $j=1$, so we start with $j=2$.

Comments

I'll end with a few observations about the model.

It has $O(N^2)$ binary variables, so, as I noted above, it will not scale all that gracefully.
As mentioned early, the relaxation of $d_{ij}$ and $r_{ij}$ from binary to continuous in the interval $[0,1]$ was deliberate, not accidental. Why is this justified? $d_{ij}$ is bounded above by an integer quantity, and the solver "wants" it to be $1$: the sooner something can be removed from memory, the better for the solution. Similarly, $r_{ij}$ is bounded below by an integer expression, and the solver "wants" it to be $0$: the less cruft hanging around in memory, the better. So the solution will be binary even if the variables are not.
The solution might be "lazy" about clearing memory. From the model perspective, there is no rush to remove unneeded results as long as they are not contributing to the maximum memory footprint. The maximum memory load $z$ may occur at multiple points in the schedule. At least one would contain nothing that could be removed (else the solution would not be optimal). At points $j$ where memory use is below maximum, and possibly at points where memory use equals the maximum, there may be tasks $i$ for which $d_{ij}=0$ in the solution but could be set to $1$, without however reducing $z$. This can be avoided by adding constraints that force $d_{ij}=1$ wherever possible, but those constraints would enlarge the model (quite possibly slowing the solver) without improving the final solution.

Friday, June 23, 2017

Premature Obituaries

[T]he report of my death was an exaggeration. (Mark Twain, 1897)

In a recent blog post, "Data Science Is Not Dead", Jean-Francois Puget discussed and dissented with a post by Jeroen ter Heerdt titled "Data Science is dead." Barring the possibility that Schroedinger shoved data science into a box and sealed it, both assertions cannot simultaneously be true. The central thesis of ter Heerdt's post is that data scientists have developed powerful and easy to use tools, now deployed on cloud services, that let lay people do the analyses that previously required said data scientists, in effect putting themselves out of business. Puget responds that "there is a very long tail of use cases where developing a packaged service isn't worth it", and draws parallels to operations research. Common and important problems such as routing, machine scheduling, crew scheduling and so on have led to the development of problem-specific commercial software, but "many companies still have an OR department (under that name, or as part of their data science department) because they have operations problems that cannot be optimized with off the shelf software of services".

I'm siding with Puget on this, having experienced the "demise" of management science (if not all of OR) some decades ago. When I started on the faculty at Michigan State University, management science (whether under that name, "operations research" or something else) was a common and important element of business school programs. We had mandatory core courses in M.S. at both the undergraduate and masters levels, as well as higher level "elective" courses that were de facto requirements for some doctoral concentrations. We also participated in an interdisciplinary OR program at the masters level.

Gradually (but not gradually enough for me), MS evaporated at MSU (a bit ironic given the respective acronyms). Some of the more applied subject matter was moved into "functional area" courses (production planning, marketing, finance); most of the more conceptual subject matter just went away. As previously noted, canned software began to be available to solve many of the problems. The perceived need shifted from someone who understood algorithms to someone who could model the problem well enough to generate the inputs for the software.

As Puget notes, there is still demand for OR/MS professionals because there are new problems to be recognized and modeled, and models to be solved that do not fit neatly into the canned software. I believe there is also another reason not to start shoveling dirt on the grave of OR/MS. Users who learned a few basic incantations in a functional area class, without learning how the magic works (or does not work), may misapply techniques or produce incorrect models. Those who learn OR/MS (or analytics) as incantations may also tend to be a bit too literal-minded.

A possible analogy is the difference between a chef and someone like me who equates "cook" with "microwave on high". A chef understands a recipe as a framework, to be adjusted as needed. You couldn't get the cut of meat called for? Switch to this meat, then adjust this spice and cook a bit longer. On the (exceedingly rare) occasions I actually prepare a dish, I follow the instructions slavishly and try not to freelance at all.

Along those lines, I attended a thesis proposal defense (for a student not trained in OR/MS) where the work involved delivery routing and included a variant of a traveling salesman model. Both the candidate and his committee took it as axiomatic that a vehicle could not pass through the same node in the routing graph twice because, well, that's part of the definition of the TSP. So I posed the following simple question. You have a warehouse W and two customers A and B, all on the same street, with W between A and B. A is in a cul de sac, so the network diagram looks like

A -- W -- B -- whatever

with any number of links to B but only the A-W edge incident A (and only A-W and B-W incident on W). Trivial exercise: prove that, under the strict definition of a TSP, A and B cannot be on the same route, no matter how close they are to W (and each other).

My point here was not to bust the student's chops. An OR "chef" building a network model for truck deliveries would (hopefully) recognize that the arcs should represent the best (shortest, fastest, cheapest ...) way to travel between any pair of nodes, and not just physical roads. So, in the above example, there should be arcs between A and B that represent going "by way of" W. It's fine to say that you will stop at any destination exactly once, but I know of only two reasons why one would route a vehicle with a requirement that it pass through any location at most once: it's either laying land mines or dribbling radioactive waste behind it. Hopefully neither applied in the student's case.

The student, on the other hand, could not see past "arc = physical roadway" ... nor was he the only one. After the defense, the student contacted two companies that produce routing software for the trucking industry. According to him (and I'll have to take his word for this), neither of them had given any thought to the possibility that passing through the same node twice might be optimal, or to creating "meta-arcs" that represent best available routes rather than just direct physical links. If true, it serves as a reminder that canned software is not always correct software. Caveat emptor.

The flip side of being too "cookie cutter" in modeling is being too unstructured. As an instructor, I cringed at students (and authors) who thought a linear regression model could be applied anywhere, or that it was appropriate to pick any variable, even a categorical variable, as the dependent variable for a linear regression. To me, at least, neural networks and software services sold as "machine learning" are more opaque than regression models. Having someone with modest if any data science training cramming whatever data is at hand into one end of a statistical black box and accepting whatever comes out the other end does not bode well.

So, in addition to Puget's "tail problems", I think there will always be value in having some people trained at the "chef" level, whether it be in OR/MS or data science, working alongside the "line cooks".

Update: I just came across an article published by CIO, "The hidden risk of blind trust in AI’s ‘black box’", that I think supports the contention that companies will need data scientists despite the availability of machine learning tools. The gist is that not understanding how the AI tool arrives at its conclusions or predictions can creative some hazard, particularly in regulated industries. For instance, if there is some question about whether a company discriminates illegally in providing mortgages or loans, "the machine said" may not be an adequate defense.

Monday, May 29, 2017

If This and That Then Whatever

I was asked a question that reduced to the following: if $x$, $y$ and $z$ are all binary variables, how do we handle (in an integer programming model) the requirement "if $x=1$ and $y=1$ then $z=1$"? In the absence of any constraints on $z$ when the antecedent is not true, this is very easy: add the constraint $$z \ge x + y - 1.$$Verification (by substituting all possible combinations of 0 and 1 for the variables) is left to the reader as an exercise.

I thought I had covered this in a previous post, but looking back it appears that it never came up (at least in this form). This might be my shortest post ever. :-)

Sunday, May 28, 2017

Update Error: Wrong Architecture

Yesterday I ran into one of those mystifying glitches that, will infrequent, serve as a reminder that Linux is not for the faint of heart. When I booted my desktop system (Linux Mint 18.1 Serena), the system tray icon for the software updater was displaying a red "X" that indicates it tried and failed to update its cache. This is not uncommon, and the solution is usually simple: open the application and click the update button.

This time, however, the update produced a slew of error messages. For what appeared to be each repository it checked, it printed an error message saying

Skipping acquire of configured file <something with "i686" in it> as repository <whatever> doesn't support architecture 'i686'

After all the architecture error messages, I also got one about a problem with a GPG encryption key on a CRAN mirror. That message had been popping up, harmlessly, on each update for quite a while, so I did not worry about it.

The PC is a 64-bit machine, with an AMD processor, and I'm used to architecture references that contain either "amd64" or "x86_64" rather than "i686". So my suspicious was that something had messed up a setting somewhere on the machine that identified the architecture to the updater. Running uname -m in a terminal got me "x86_64", confirming that the machine itself was not undergoing any sort of identity crisis.

After a fruitless search for a malformed configuration file, I went to a Mint help forum, where I received a few answers. One said it was a server error, and I should just wait for them to fix it. That didn't hold water for a couple of reasons. First, it would require simultaneous problems on a whole bunch of unrelated repository servers, which would be a bit too coincidental. Second, my laptop (also using an AMD 64-bit chip, but running Mint 18.0 Sarah and a slightly older version of the updater) had no problem updating. Another suggestion, reinstalling the updater, also bore no fruit.

Along the way, I discovered that in fact two repositories were being updated correctly. Those two, not coincidentally, specified "[arch=amd64]" in their source listings. Adding that string to all the other PPAs and extra repositories worked, albeit at some cost (the updater started to label the repositories as "arch=amd64" rather than by their correct names). Unfortunately, the updater would not let me use that trick on the two main repositories (one for Mint, one for Ubuntu), which left me still looking for a fix.

Someone on the help forum suggested disabling all the PPAs on my list of sources and trying again. I did that, and (crucially) also disabled sources in the "Additional repositories" list, leaving just the Mint and Ubuntu repositories. Lo and behold, the update went through. So I started restoring the other repositories in a binary search, and found the culprit. Remember the CRAN mirror that had a problem with its encryption key? If I disabled just that source, which was in the "Additional repositories" list, things worked. So I switched mirrors, reloaded the public key, and things are back to normal.

I have no idea why the encryption key error, which I'd been getting for weeks if not months, suddenly caused things to go splat across the board -- and even less intuition as to why a bad encryption key would cause the updater to start using the wrong architecture code. For now, though, I'll settle for having updates working again.

Update 06/28/17: It happened again, this time with the Revolution Analytics CRAN mirror yielding the message "server certificate verification failed" and something about a missing file (which I assume was their certificate). So I switched to yet another mirror, and we're back in business ... for the moment. I don't like the idea of downloading software over insecure/unencrypted connections (for reasons that I would hope are obvious), but this is getting tedious. Note to server admins: please keep on top of your certificate deployment.

Update 07/25/17: Here we go again, only this time with a new wrinkle. I added a PPA, did an update to the package list, and got the 'i686' architecture message for every repository. This time, though, there were no certificate errors, and disabling every source I could (including the newly added PPA) did not help. So something else was at fault. After searching the web for help, I tried the following commands in a terminal:

dpkg --print-architecture
dpkg --print-foreign-architectures

The first one listed 'amd64' (which is correct); the second listed 'i386' and 'i686'. I have one or more packages installed that actually do require the 'i386' architecture -- packages that only exist in 32 bit form (don't ask me which, I've forgotten) -- but none that I know of requiring 'i686'. So, again in a terminal, I ran

sudo dpkg --remove-architecture i686

and (after restoring all my repos) was able to update successfully. I don't know if adding the new PPA caused the 'i686' architecture to return, zombie-like, for another go around, or whether the package manager just wants to reset to that whenever I'm not looking. In any event, I think I have a fix now, but we'll see if it's permanent or temporary.

Thursday, April 20, 2017

Sharing "Implicit" Test Problems

The topic of reproducible research is garnering a lot of attention these days. I'm not sure there is a 100% agreed upon, specific, detailed definition of it, and I do think it's likely to be somewhat dependent on the type of research, but for purposes of this post the Wikipedia definition (previous link) works for me. In particular, I want to call out one part of the Wikipedia entry:

The term reproducible research refers to the idea that the ultimate product of academic research is the paper along with the laboratory notebooks and full computational environment used to produce the results in the paper such as the code, data, etc. that can be used to reproduce the results and create new work based on the research.

I've seen a lot of press related to reproducibility in the biological sciences, although I assume it's an issue in other sciences as well. (Good luck with that, physicists!) It's also increasingly an issue in the social sciences, where one study asserted that over half of the published results they tested failed to replicate. (That study itself was criticized as allegedly failing to replicate.) All of this has been termed by some a "replication crisis". In short, reproducibility is a big deal. There's at least one blog devoted to it, and a considerable amount of work in the statistics community is going into tools to facilitate reproducibility (such as R and Python "notebooks"). R users in particular should have a look at the rOpenSci Reproducibility Guide.

My research tends almost exclusively to fall into the category of "stupid math programming tricks". Either I'm trying to find some clever formulation of a (deterministic) problem, I'm trying to find an efficient way to generate an exact solution, I'm trying to find a decent heuristic for getting "good" solutions "quickly" (preferably without stretching analogies to nature too far: there's no way I'm beating slime mold optimization in the naming contest), or some combination of the above. Since I mostly avoid statistics (other than the occasional comparison of run times with some selected benchmark alternative), I've been largely unaffected by the debate on reproducibility ... until now.

Recently, the journals published by INFORMS have moved in the direction of reproducible research, and I suspect others are doing so (or will do so in the near future) as well. As an example relevant to my interests, the INFORMS Journal on Computing (JOC) has introduced policies on the sharing of both data and code. Personally, I think this is a good idea. I've shared data and code for one particular paper (machine scheduling) on my web site since the paper came out (20+ years ago), and I share data for a more recent paper as well (having released the code as an open source project).

At the same time, I recognize that there are various difficulties (licensing/copyright, for instance) in doing so, and also there are costs for the researcher. I'd be willing to share the code for one or two other projects, but it would take a major effort on my part to untangle the spaghetti and figure out which libraries are/are not required, and which code should be included or should be excluded as irrelevant to the ultimate experiments. I'm reluctant to commit that time until someone actually requests it.

There's another twist that I would not have anticipated until quite recently. I'm working on a project that involves "implicit hitting set" (IHS) problems. In an IHS problem, you have a master problem that formulates as a pure integer (in fact, 0-1) program. Candidate solutions to that model are fed to one or more "oracles", which either bless the solutions as feasible or generate additional constraints for the master problem that the candidates violate. Note that I have not said anything about the nature of the oracles. Oracles might be solving linear or integer programming models, which would be relatively easy to share as "data", but they might also be doing something decidedly funky that is encapsulated in their coding. In the latter case, the test "data" is actually code: I would have to provided the code for the oracles in order for someone to reproduce my results.

Well, okay, if sharing code is on the table now, isn't that just more code? Not quite. Let's say that some unfortunate doctoral student has been tasked by her advisor to code their shiny new IHS algorithm and test it against my published problems. The doctoral student used Python or Julia (or some other trendy language), whereas I used stodgy old Java, so there's a good chance the luckless doctoral student will have to recode my stuff (which, among other things, requires making sense of it). Moreover, I will have created an API to my oracles that may or may not fit with what they are doing ... and that's if I used an API at all. If I directly integrated program structures external to the oracle functions into the oracle (passed in variables, constraints or other elements of my master problems, for instance), our doctoral student will need to figure out how to isolate those elements and replace them with corresponding elements from her code ... if there is even a correspondence.

There's at least one implication in this for me. If I actually subscribe to the push for reproducibility (or if I take pity on other people's doctoral students), I need to code my oracles not as an integral part of my master code but as a separate software module or library with a well-documented and reasonably flexible interface to the outside world (API). <sigh>More work for me.</sigh>

Tuesday, March 28, 2017

Adventures with TeX Live

Since I use the Linux Mint operating system, the obvious (if not only) choice for a LaTeX distribution is TeX Live. (If you are not familiar with, or are not interested in, the LaTeX typesetting system, you have already read too far in this post.) On Mint, Ubuntu and other Debian-type operating systems, you typically install TeX Live by downloading and installing a number of fairly large files from a repository (in my case, mainly repositories for Ubuntu). Sometimes, though, you just want to install one or two small LaTeX packages without bothering with one of the large TeX Live files (which will contain a zillion other LaTeX packages you do not want or need). For this purpose, TeX Live includes a package manager (tlmgr).

Today, I found myself trying unsuccessfully to install something using tlmgr. The first problem was a missing compression/decompression program that was easy to install. After that, I continued to get cryptic error messages. Eventually, I decided to update tlmgr to see if that would help.

The Canonical (Ubuntu) repositories are fairly far behind the curve when it comes to TeX Live; they still provide the 2015 version, which is incompatible with newer versions (including the newer version of tlmgr). So updating tlmgr alone was not an option. (I had what appeared to be the most recent, and presumably final, incarnation of the 2015 version of tlmgr.) Following instructions I found on the Tips on Ubuntu site, I added a PPA containing the latest (I think) version of TeX Live, running the following in a shell:

sudo add-apt-repository ppa:jonathonf/texlive
sudo apt update

After that, I used the software updater to update my TeX Live packages. (The "Tips on Ubuntu" instructions have you install TeX Live from the PPA, which presumes you do not already have a version installed, as I did.)

Before using tlmgr to do much of anything, you need to initialize the user tree by running

tlmgr init-usertree

in a shell. I had already done that, but I repeated it, confirming that the user tree was still intact. I then ran

tlmgr --gui

to open a graphical user interface, told it to load the default repository (there's a button in the "Repository" panel of the display), and tried -- unsuccessfully -- to install a LaTeX package.

So what was wrong now? The error message contained the following snippet:

Tk::Error: /usr/bin/tlmgr: open(/usr/share/texlive/tlpkg/texlive.tlpdb)
failed: No such file or directory ...

I checked /usr/share/texlive/tlpkg, and sure enough there was no file texlive.tlpdb ... but there was a file named texlive.tlpdb.9ed73e8174b03a21aa0d40ebbcaac97f. It's a text file (around 12 MB) with configuration information and what looks like a directory structure. Why it had the rather funky hexadecimal extension, I have no idea. I symlinked it by running

sudo ln -s texlive.tlpdb.9ed73e8174b03a21aa0d40ebbcaac97f texlive.tlpdb

and tlmgr started behaving itself.

I later deleted the file with the scary looking hex extension, reran tlmgr and had it load the default repository again. It reproduced the file, but with a different (equally scary) hex extension. I again symlinked the file to texlive.tlpdb.

I have no idea why loading a repository produces what appears to be the correct file with the wrong extension and does not symlink it, but at least this fix seems to work.

Monday, February 27, 2017

MSU Students Do Microfinance

Several years ago, I found out about Kiva.org, an online "microfinance" site where individuals can make small loans ($25 is the standard increment at Kiva) to entrepreneurs in low income settings. The entrepreneurs actually apply for larger amounts, which they typically receive from third-party "field partners". The entrepreneurs repay the loans with interest to the field partners, who in turn repay the principal (no interest) to the original donors. A modest deposit by a donor can be turned over repeatedly into loans, enjoying a version of what in my long-ago undergrad econ class was termed the "multiplier effect". I've made a fair number of loans over the years, nearly all of which have been fully repaid. It's a bit of fun as well as personally satisfying.

So I was geeked to learn from one of my colleagues at the Broad College of Business, Professor Paulette Stenzel, that we have a student organization on campus dedicated to microfinance. The Spartan Global Development Fund is a 501(c)(3) nonprofit organization, created and run by students, that makes microloans both through their own field partners and through Kiva. It's a registered student organization, meaning that any Michigan State University student can join. Students and nonstudents alike can donate through PayPal.

Next time I'm on the Kiva site, I plan to join their Kiva "team", and I encourage anyone interested in microfinance to check out their site.

Thursday, February 23, 2017

Another Absolute Value Question

Someone asked whether it is possible to eliminate the absolute value from the constraint$$Lx\le |y| \le Ux$$(where $L\ge 0$ and $U>0$ are constants, $x$ is a binary variable, and $y$ is a continuous variable). The answer is yes, but what it takes depends on whether $L=0$ or $L>0$.

The easy case is when $L=0$. There are two possibilities for the domain of $y$: either $y\in [0,0]$ (if $x=0$) or $y\in [-U,U]$ (if $x=1$). One binary variable is enough to capture two choices, so we don't need any new variables. We just need to rewrite the constraint as$$-Ux\le y \le Ux.$$If $L>0$, then there are three possibilities for the domain of $y$: $[-U, -L] \cup [0,0] \cup [L, U]$. That means we'll need at least two binary variables to cover all choices. Rather than change the interpretation of $x$ (which may be used elsewhere in the questioner's model), I'll introduce two new binary variables ($z_1$ for the left interval and $z_2$ for the right interval) and link them to $x$ via $x=z_1+z_2$. That leads to the following constraints:$$-Uz_1 \le y \le -Lz_1 + U(1-z_1)$$and$$Lz_2 - U(1-z_2) \le y \le Uz_2.$$ If $x=0$, both $z_1$ and $z_2$ must be 0, and the new constraints force $y=0$. If $x=1$, then either $z_1=1$ or $z_2=1$ (but not both). Left to the reader as an exercise: verify that $z_1=1\implies -U\le y \le -L$ while $z_2=1 \implies L\le y\le U$.

Saturday, January 21, 2017

Fischetti on Benders Decomposition

I just came across slides for a presentation that Matteo Fischetti (University of Padova) gave at the Lunteren Conference on the Mathematics of Operations Research a few days ago. Matteo is both expert at and dare I say an advocate of Benders decomposition. I use Benders decomposition (or variants of it) rather extensively in my research, so it ends up being a frequent theme in my posts. Those posts tend to generate more comments than posts on other topics. Apparently Matteo and I are not the only BD users out there.

I don't know that I would recommend Matteo's presentation as the starting point for someone who has heard of BD but never used it, but I certainly recommend having a look at the slides for anyone who has any familiarity with BD. Matteo provides several interesting perspectives as well as a tip or two for potentially improving performance. I learned a few new things.

In a sad coincidence, Professor Jacques Benders, the originator of Benders decomposition, passed away at age 92 just eight days before Matteo's presentation.

Tuesday, January 10, 2017

Pro Bono Analytics Is Growing Social

Pro Bono Analytics is a program by INFORMS (the Institute for Operations Research and the Management Sciences, for the acronym-averse), "the largest society in the world for professionals in the field of operations research (O.R.), management science, and analytics". PBA "connects our members and other analytics professionals with nonprofit organizations working in underserved and developing communities". In other words, we hook up charitable organizations needing analytics or OR help with volunteers willing to provide it without compensation. Our volunteers are a mix of industry practitioners, academics, students and the occasional ~~geezer~~ retiree. I think the majority of the volunteer pool comes from the U. S., and a majority are INFORMS members, but we have volunteers from as far away as Australia, and a significant portion are non-members.

I'd encourage anyone with OR/MS/analytics skills to consider volunteering, and anyone who knows a charitable organization (particularly those of limited financial means) to let them know we're out there willing to extend a helping hand. Both potential clients and potential volunteers can find out more, and signify their interest, at the PBA home page (repeating the link above).

Our new (and apparently energetic) staff liaison, Tia Carrai, has expanded PBA's social media footprint. You can find us now on the following:

Facebook: www.facebook.com/INFORMSProBonoAnalytics
LinkedIn: Pro Bono Analytics
Twitter: ProBono Analytics (@ProBonoAnalytic) | Twitter

I'd also like to give a shout-out to Pro Bono O.R., a like-minded initiative by our sister institution in the U. K., the Operational Research Society.

Friday, January 6, 2017

Mapping Trackball Buttons

For years, I've used a Logitech M570 wireless trackball with my Linux Mint PC. I generally prefer trackballs to mice -- no need to lift and reposition after a bunch of movement -- and I find that using my thumb, rather than my index finger (or, if I'm in a bad mood, my middle finger) to move the cursor is less fatiguing for my hand and wrist. The M570 works fine with Linux (at least Mint and Ubuntu) with no need for additional drivers.

In addition to the usual (at least for non-Mac users) three main buttons (left, write and combination scroll wheel/third button), the M570 has a couple of secondary buttons, which Logitech describes as "large, easy-to-reach Back/Forward buttons". I'm not sure I'd agree about "large", but I agree that they are easy to reach, and up until my most recent operating system upgrade I would have agreed they acted as back and forward buttons. Prior to the upgrade (and with no special configuration on my part, at least that I can remember), the extra buttons acted like page up and page down in every web browser or document reader that I used. After the upgrade (and switch from the Cinnamon desktop to the Mate desktop), their behavior changed. In the Firefox web browser, they would switch among tabs rather than vertically scrolling the current tab. In Xreader and Acrobat Reader, they also did not move forward backward among pages. (I don't recall trying to read multiple documents at once to see if they would switch among open documents.)

I find the forward/backward action rather handy with multipage documents and long web pages, so I wanted the previous behavior back. It turned out (after some research) not to be hard to do.

The first step is to install the xbindkeys and xautomation packages, both available from the Canonical repositories. This can be done via Synaptic or by running

     sudo apt-get install xbindkeys
     sudo apt-get install xautomation

in a terminal. (I also installed xbindkeys-config, which xbindkeys "suggests", but I don't think it's really necessary.) The xautomation package provides a command xte that can fake a key press.

The second step is to determine what button numbers are assigned to the forward and backward buttons. Run

     xev | grep -i button

in a terminal. This will open a small window with a target. Position the cursor over that window and click each of the buttons. You will see two messages in the terminal for each button, one for the press and one for the release. Look for the button numbers. For me, they were button 8 for forward and button 9 for backward, but your mileage may vary.

The third step is to configure xbindkeys to translate the extra trackball buttons appropriately. The configuration for xbindkeys is kept in a plain text file named .xbindkeysrc in your home directory (~/.xbindkeysrc). You can either create a new one (if you don't have one yet, or if there is nothing in it you want to keep) containing the lines below, or append those lines to the existing file, using your choice of text editor.

# Key bindings for Logitech M570 trackball
"xte 'key Page_Up'"
b:9
"xte 'key Page_Down'"
b:8

If your button numbers differ from mine (8 and 9), edit the lines accordingly. Note carefully the single and double quotation marks; xte seems a bit finicky about them.

Finally, you'll want to test this. You need to restart xbindkeys to get it to read the modified configuration file. Theoretically you can do this by running

     killall -HUP xbindkeys

in a terminal, but I found it necessary to restart the X server (after closing any applications using it, such as my web browser and email client). Typically Control+Alt+Backspace will do the trick. After that, try the trackball keys and hopefully they'll behave as expected.