OR in an OB World: November 2017

Monday, November 13, 2017

Benders Decomposition with Generic Callbacks

Brace yourself. This post is a bit long-winded (and arguably geekier than usual, which is saying something). Also, it involves CPLEX 12.8, which will not ship until some time next month.

I have an updated version of an old example, solving a fixed charge transportation problem using Benders decomposition. The example (using Java, naturally) is scattered across several posts, going back a ways:

the original example was posted in 2012;
an updated version was posted in 2014; and
when IBM added automated Benders decomposition to CPLEX 12.7 in 2016, I revised the example yet again.

Now here I am breaking the even year pattern by doing yet another version in 2017. If you want to see the MIP model written out in mathematical notation, it's in the 2012 post. If you want to see the code for the most recent version, which contains both a "manual" decomposition (using callbacks) and a few versions of the new (in 12.7) automatic decomposition feature, there's a link to the repository in the 2016 post. At this point in time, the 2014 post is not worth reading.

Before I get into the updated example code, let me outline some key differences between the old ("legacy") callback system and the new ("generic") callback system.

When is a callback called by CPLEX?

Legacy: This is dictated by the type of callback. A branch callback is called when CPLEX is ready to split a node into children, a lazy constraint callback is called when CPLEX identifies what it thinks is a new integer-feasible solution, and so on.
Generic: When you attach your callback to the model (IloCplex instance), you specify via a bit mask in which of several "contexts" it should be called. As of this writing (things do have a habit of changing), the contexts are Candidate (a new integer-feasible candidate for incumbent solution has been found), GlobalProgress (progress has occurred at the "global" level -- presumably something you might see in the node log), LocalProgress (a thread has made what it thinks is progress, but has not yet passed it back to the controller), Relaxation (CPLEX has found a solution that might not be integer-feasible, such as the solution to a node LP), ThreadDown (CPLEX deactivated a thread) and ThreadUp (CPLEX activated a thread).

If you are fuzzy about the difference between local and global progress, one example is a thread that finds a new incumbent solution, one that is better than the incumbent that existed when the thread was launched, but not as good as the current incumbent (recently found by a different thread). That is local progress (in the first thread) but not global progress (the incumbent found by the first thread will be ignored when it gets around to reporting it to the controller).

Attaching callbacks to models:

Legacy: You attach separate callbacks of each desired type (info callback, branch callback, user cut callback, ...).
Generic: You attach a single callback that handles all supported operations. Attaching a second callback automatically detaches the first one. When attaching the callback, you specify via a bitmap from which context(s) the callback is to be called.

Callback classes:

Legacy: Your class extends one of the existing callbacks. For instance, if you want a lazy constraint callback, you extend the IloCplex.LazyConstraintCallback class and override the abstract main() method.
Generic: Your class implements the IloCplex.Callback.Function interface and overrides the abstract invoke() method.

Calling functions inside the callback:

Legacy: You call one of the methods for the class you extended. For example, in a branch callback (extending IloCplex.BranchCallback), you would call one of the overloads of makeBranch() to create a new branch.
Generic: When CPLEX calls the invoke() method, it passes in as an argument an instance of IloCplex.Callback.Context. This object has a number of methods you can invoke to find out the context, access a candidate solution (if one exists), get information about what is going on (including the index of the thread in which it was invoked, and the total number of threads in use), etc. Basically, this is your key to the castle.

Benders decomposition

For Benders decomposition, we will be focused on the Candidate context (CPLEX thinks it has an integer-feasible solution, which we need to test) and the ThreadUp and ThreadDown contexts, for reasons I'll eventually get to.

I wrote about thread safety in my previous post. Let me pin down why this is critical when using Benders decomposition. With Benders, we have a master problem and one or more subproblems. (I'll stick with one subproblem here, for concreteness.) The master problem model itself is never modified. Benders cuts are added to the working copy of the master problem, from the callback. Several threads could be generating Benders cuts more or less simultaneously, but it is the responsibility of CPLEX (not your code) to make sure that multiple threads adding cuts don't trip over each other.

On the other hand, the way we generate cuts is to use the proposed incumbent solution to modify the constraint limits in the subproblem (or the subproblem objective, if you're one of those people who prefers to work with the dual of the actual subproblem). It's our code (specifically, the callback) making those changes, and if changes are being made concurrently by multiple threads, bad things can happen. Just to verify this is not hypothetical, I converted my original Benders code to the new callback system and ran it without doing anything special regarding thread safety. It crashed the JVM in a matter of seconds. Lesson learned.

So what are our options? One is to use a single copy of the subproblem and impose a locking mechanism that prevents more than one thread from accessing the subproblem at a time. That turns out to be the easiest to program in Java (at least in my opinion), but it slows the solver down due to threads being blocked. What seems to be the general wisdom for using locks is that you should only use them on chunks of code that execute quickly. Modifying what might be a large LP model, solving it, and finding a dual ray or Farkas certificate if it is infeasible or the dual solution if it is feasible does not meet the "execute quickly" criterion.

Another option is to generate an independent copy of the subproblem each time a thread needs it. That seems wastefully slow to me. A third, more promising approach is to generate one copy of the subproblem per thread and store it, reusing it each time the callback is called in that thread. This again eats CPU cycles generating the subproblem multiple times, but once per thread is a lot less than once each time CPLEX calls the callback. A downside, however, is that this could consume a prohibitive amount of memory if the subproblem is quite large.

The revised Benders example

The latest version of my Benders example is available at https://gitlab.msu.edu/orobworld/BendersExample3. In addition to my code (and Java), you will need CPLEX 12.8 (which I am told is coming out in December) and the open-source Apache Commons CLI library (for processing command line options).

I've added some command line options to let you alter a number of things without having to alter or comment out code: which models to run; what random seed to use; how many trials to run (default 1); whether to show CPLEX log output for neither problem, just the master, or both master and subproblem; whether to show the formulas of the cuts generated (or just settle for messages telling you when a cut is added); and whether to print the full solution (which warehouses are used, what the shipment volumes are). Run the program with "-h" or "--help" on the command line to get the full list of options.

The new code has two versions of the "manual" (callback-based) Benders method. The "-m" command line option runs the default "manual" model, while the "-m2" option runs what I inelegantly named the "thread-local manual" model. The difference is in the approach taken to assure thread safety. I'll briefly describe that below, and of course you can pore over the code if you want more details.

Locking

The default manual model avoids collisions between threads by locking parts of the code so that one thread cannot execute certain portions when any other thread is executing certain portions. There are multiple ways to do this in Java, but a simple one (which I use here) is to synchronize methods. To do this, all you have to do is add the keyword synchronized to the method declarations. (In addition to synchronizing methods, you can synchronize smaller chunks of code, but I did not use that feature.)

My ManualBenders class has a private inner class named BendersCallback. The callback is entered only from the Candidate context. Its sole method is the invoke() method mandated by the IloCplex.Callback.Function interface described earlier. By declaring that method

public synchronized void invoke(final IloCplex.Callback.Context context)

I ensure that no thread can invoke the callback while another thread is in the callback. The good news is that it works (prevents threads from interfering with each other) and it's incredibly easy to do. The bad news is that it results in threads blocking other threads. The subproblem in the example is pretty small. If it were slow to solve (or large enough to make updating bounds meaningfully slower), performance would be even worse.

Thread-specific copies of the subproblem

The "thread-local" model avoids thread collisions by giving each thread a separate copy of the subproblem to play with. I created a class (cleverly named Subproblem) to hold instances of the subproblem. The callback (instance of class BendersCallback2) is now entered from three different contexts, ThreadUp, ThreadDown and Candidate. As before, in the Candidate context we do the usual things: use the candidate solution to the master problem to modify constraint limits in the subproblem; solve the subproblem; depending on whether it is feasible, and what its optimal objective value is, either accept the new incumbent or reject it by adding a cut. We use calls in the ThreadUp context to create and store a copy of the subproblem for the newly activated thread, and similarly we use calls in the ThreadDown context to delete the local subproblem copy and recover its memory.

So where do we store subproblem copies so that only the thread invoking the callback can see its designated copy? In their BendersATSP2.java example code, IBM creates a vector, of length equal to the maximum number of threads that will be used, to store subproblems. The context argument to the callback provides a method, context.getIntInfo(IloCplex.Callback.Context.Info.ThreadId), that returns the zero-based index number of thread invoking the callback, which is used to pluck the correct copy of the subproblem from the vector of copies.

I took a slightly different approach in my code, using a generic Java class named ThreadLocal that exists precisely to hold objects local to a given thread. My BendersCallback2 class contains a private field declared

private final ThreadLocal<Subproblem> subproblem

to hold the copy of the subproblem belonging to a given thread. When the callback is called from the ThreadUp context, subproblem.set() is used to store a newly constructed copy of the subproblem; when called from the Candidate context, subproblem.get() is used to access the copy of the subproblem belonging to the thread. (Since this is all rather new, both to CPLEX and to me, and since I have a suspicious nature, I stuck some extra code in the example to ensure that subproblem is empty when called from ThreadUp and not empty when called from Candidate or ThreadDown. There are comments indicating the superfluous code. Feel free to omit it from your own projects.)

I thought that when a thread was deleted, the subproblem variable would be deleted. Since subproblem is (I think) the only pointer to the Subproblem instance for that thread, I expected the memory used by the Subproblem to be recovered during garbage collection. Apparently I was wrong: my initial attempt leaked memory (enough to shut down the JVM after solving a few problems). So I added an invocation of the callback from the ThreadDown context (thread deactivation), at which point the callback deletes the Subproblem instance. That seems to have cured the memory leak. So, in summary, the callback is called in three contexts:

ThreadUp -- create a copy of the subproblem for the thread;
Candidate -- test a candidate incumbent and, if necessary, generate a Benders cut; and
ThreadDown -- delete the copy of the subproblem belonging to that thread.

I'm comfortable with this approach, but you might prefer the subproblem vector used in the BendersATSP2.java example.

Performance Comparison

I ran five solution methods (Benders with locking, Benders with copies of the subproblem for each thread, and three versions of the automatic Benders decomposition added to CPLEX 12.7) on 10 instances of the fixed charge transportation problem. That's a small sample, based on a single type of problem, on a single machine (four cores, 4 GB of RAM), so I wouldn't read too much into the results. Nonetheless, I thought I would close by comparing solution times. The three automatic methods provided by CPLEX ("annotated", "automatic" and "workers", described in my 2016 post) had virtually indistinguishable times, so to make the plot more reasonable I am including only the "automatic" method (labeled "Auto"), along with my two approaches ("Local" for the method using copies of the subproblem for each thread and "Locked" for the method using synchronization).

The plot below shows, as a function of wall clock time, the fraction of trials that reached proven optimality by the specified time. The automatic approach and my manual approach with thread-local subproblems were competitive (with the automatic approach having a slight advantage), while the manual approach with synchronized class methods lagged rather badly. Although I cannot say with certainty, I am rather confident that is a result of threads engaged in solving a subproblem blocking other threads wanting to do the same.

performance plot (completion rate versus run time, by method)

Finally, just to drive home the realities of multiple threads, here is a sample of the output generated by the synchronized (locked) approach during one run.

>>> Adding optimality cut
>>> Accepting new incumbent with value 3598.1162756564636
>>> Accepting new incumbent with value 3598.1162756564636
>>> Accepting new incumbent with value 4415.750134379851
>>> Accepting new incumbent with value 4415.750134379851
>>> Accepting new incumbent with value 4411.334611014113
>>> Accepting new incumbent with value 4411.334611014113
>>> Accepting new incumbent with value 4700.008131082763
>>> Accepting new incumbent with value 4700.008131082763
>>> Adding optimality cut
>>> Adding optimality cut
>>> Adding optimality cut
Final solver status = Optimal
# of warehouses used = 14
Total cost = 3598.1162756564636
Solution time (seconds) = 143.090

The objective here is minimization. The messages beginning ">>>" are printed by the callback. Notice that after the optimal solution is found and accepted (in two different threads?), provably suboptimal solutions (solutions with inferior objective values) appear to be accepted several times. This also happens with the thread-local approach. It may well be happening in the various automated Benders approaches, but since they do not print progress messages there is no way to tell. This illustrates the difference between "local" and "global" progress. It happens because the various worker threads do not communicate with each other, and communicate only sporadically with the controller thread. So if thread 3 finds the optimal solution and has it accepted, thread 1 does not know this and continues cranking through its portion of the search tree, finding new "incumbents" that are improvements over the best solution thread 1 has seen but not necessarily improvements over the best global solution. The inferior solutions are accepted in the threads in which they were found, but the controller knows not to accept them as incumbents (because it already has a better incumbent). This behavior is not an error; it's just another indication why adding threads does not improve performance proportionally.

Saturday, November 4, 2017

Thread Safety

As I noted in yesterday's post, one of the major changes associated with the new "generic" callback structure in CPLEX is that users now bear the responsibility of making their callbacks thread-safe. As I also noted yesterday, this is pretty new stuff for me. So I'm going to try to share what I know about thread safety, but bear in mind that I don't know all that much (and don't know what I don't know). In a subsequent post, I'll share an updated example of Benders decomposition using the new callback structure.

What is a thread?

I'll refer you to the Wikipedia for a detailed definition. Basically, a thread is a chunk of code that is set up to be run asynchronously. Typically, the point of creating a thread is to allow the code to run in parallel with other parts of the parent program. The operating system swaps threads in and out of processor cores. Part of the reason for creating additional threads is to exploit the presence of multiple processor cores, but that's not the only reason. Consider a program designed to do some computationally intensive operation (such as solving an integer program) and assume the program has a graphical user interface (GUI). Chances are the GUI runs in a different thread from the computational portion of the program, even if it is running on a single-core processor. Otherwise, with everything in one thread, the GUI would become unresponsive for the entire time the program was crunching numbers. Among other things, that would make it impossible to use a GUI control or menu item to abort the computation if, say, you were jonesing to check your Twitter feed.

Before continuing, I think it's worth noting three things here. The first is that, in an era of multi-core computer processors, multithreaded applications are increasingly common, and increasingly attractive. CPLEX defaults to using multiple threads when solving optimization problems, although you can control the number threads used (including throttling it to a single thread) via a parameter setting. Second, performance improvement due to multithreading is sublinear. If you go from one thread to two, the reduction in execution time is less than 50%. If you go from one thread to four, you will likely not see anywhere near a 75% reduction in processing time. This is partly due to added overhead required to set up and manage threads, and partly to the fact that threads can get in each others' way, jostling for CPU time and blocking progress of their siblings. Finally, making an application multithreaded may increase memory use, because you may need to make separate copies of some data for each thread.

What is thread safety?

Again, I'll refer you to a Wikipedia definition. The key concepts, I think, is that you want to defend against a couple of dangers. The first is that threads might block each other. Deadlock can occur when one thread is waiting for another thread to do something but simultaneously blocking the second thread from doing it (say, by hogging a resource). That's actually an oversimplification, in that more than two threads can be contributing to a deadlock. Starvation occurs when some thread cannot access the resources it needs because other threads (several different threads, or one thread over and over) keep blocking it.

The second danger, which helps explain how things like starvation or deadlock can come about, is the danger that one thread writes shared data while another thread is reading or using it. For instance, consider Benders decomposition. (This is not a hypothetical example: you'll see it in the next post, if you stick around.) Assume, as is usually the case, a MIP master problem and a single LP subproblem, and assume we are using a callback to test proposed integer feasible solutions coming from the master problem. When the solver thinks it has a new incumbent for the master problem, it calls the callback. The callback uses the proposed incumbent to adjust some constraint limits in the subproblem, solves the subproblem, and based on the result either accepts the incumbent or cuts it off with a Benders cut.

Now suppose that two different threads, A and B, both get (different) proposed incumbents, with A beginning processing a little ahead of B. A modifies the LP subproblem and tries to solve it, but before it gets to the point of calling the solver B (on a different core) starts modifying the subproblem. So when A calls the LP solver, the subproblem it is solving has a mix of some modifications A made and some modifications B made. At best, A ends up solving the wrong LP (and not knowing it). At worst, B is modifying the LP while A is trying to solve it. With CPLEX, at least, if this happens CPLEX throws an exception and the program likely grinds to a halt.

How do we make code thread-safe?

Good question. I don't know all the methods, but there are two fundamental techniques that I do know. The first is locks. Basically, locks are semaphores (flags, signals, whatever) that tell the system not to let any other thread touch some part of memory until the thread owning the lock is done. You write your code so that the part that will run concurrently locks shared objects, does what it needs with them, and then unlocks them. On the one hand, it's important to lock everything that needs locking. Missing one item is like trying to burglar-proof your home but then leaving the back door wide open. On the other hand, it's important to lock only what has to be locked, and only for as long as it needs to be locked. Hanging on to a lock for too long can block other threads, and hurt performance.

The second technique is to avoid contention for data by giving each thread its own personal copy of the data. Thread-safety varies from language to language. In Java, each thread gets its own stack, containing local variables, and no thread can touch another thread's stack. So, for instance, if a thread starts a loop indexed by local variable i, there is no danger that i is touched by another thread. On the other hand, Java objects are parked in the heap, and are available to any thread that knows the addresses of the objects. So to avoid collisions between threads, you can either copy the original data to the stack for each thread and let the thread mess with its own copy, or (if the data is a Java object) create a clone of the original object for each thread. The clone will live in the heap, but only the thread for which it was created will know its address, so no other thread will screw with it.

My modified Benders example (subject of a future post, after CPLEX 12.8 ships) will demonstrate both approaches.

If you are a Java programmer, and if multithreading is new to you, I recommend you look at Oracle's tutorial on concurrency in Java. It is written well, contains examples of things that can go wrong, and covers much if not all of what you need to know to handle thread safety while working with CPLEX generic callbacks.

Friday, November 3, 2017

CPLEX 12.8: Generic Callbacks

IBM is getting ready to release CPLEX 12.8, and I had the opportunity to attend a presentation about by Xavier Nodet at the 2017 INFORMS annual meeting. Here are links to two presentations by Xavier: CPLEX Optimization Studio 12.8 - What's New and CPLEX 12.8 - the Generic Callback. As with any new release, there are multiple tweaks, performance improvements and some new features. What I want to focus on in this post and the next post or two (depending on how wordy I get) is a major change to the way CPLEX implements callbacks.

I'm going to assume any reader continuing beyond this point knows what a callback is (and, for that matter, knows what CPLEX is). The callback system used through CPLEX version 12.7.1 is now being referred to as the "legacy" callback system. A new system, beginning in version 12.8, is called the "generic" callback approach. I'll outline what I think are the key take-aways here, but if you are a callback user, I strongly suggest you check out Xavier's slide show linked above.

The first thing to know is that, for at least the near future, CPLEX will continue to support legacy callbacks. That's important for two reasons: it may take time to adjust to the new generic callbacks (and to port code to the new system); and not everything you could do with legacy callbacks is available with generic callbacks (more on that below). The second thing to know is that you cannot mix legacy and generic callbacks in the same program. You have to use one or the other exclusively (or neither).

The new generic callback approach involves trade-offs. The legacy system involves essentially two types of callbacks: "information" callbacks (asking CPLEX what's going on at various points in the solution process) and "control" callbacks (meddling with how CPLEX solves a problem). Changes between how CPLEX worked back when control callbacks were introduced and how it works now created some adventures for both IBM and users. For instance, certain control callbacks were incompatible with dynamic search, so the presence of such a callback (even if it contained no code and did nothing) would cause CPLEX to turn off dynamic search. There were some other side effects, but I'm neither qualified nor motivated to list them all. Suffice it to say that IBM decided a revamped callback system, designed from the ground up to work fully with the current version of CPLEX, was warranted.

So here come the trade-offs. In exchange for the ability to use callbacks without having to turn off dynamic search, limit CPLEX to a single thread, or whatever, you lose some functionality. Generic callbacks are currently not implemented for continuous problems. (Again, you can still use legacy callbacks on those.) Generic callbacks do not support user-controlled branching or node selection, nor user solution of node problems. A side effect of losing the branch callbacks is that users apparently can no longer attach node data to nodes and query it later (since, in the legacy system, the node data was attached in a branch callback). If you need any of those features, you have to stick to legacy callbacks for now (while you lobby IBM and/or your congressman to get your favorite features added to the generic callback system). Apparently, data IBM has accumulated on which features are or are not widely used suggested that omitting branch and solve callbacks would inconvenience a small portion of users.

Finally, here comes what I think is potentially the biggest issue: it is now the responsibility of the user to ensure that a generic callback is thread-safe. This is not always easy to do, and it requires a depth of programming knowledge that you don't typically get in a first or maybe even second course on programming. To put it another way, I've been writing code for well over 45 years (going back to FORTRAN on a mainframe), and my experiments with the beta version of 12.8 represent the first time I've ever had to handle thread safety. In my next post (or two, depending on how chatty my muse is), I'm going to look specifically at thread safety. Meanwhile, if you're not comfortable with it, my advice is to stick to legacy callbacks (but start learning about writing thread-safe code -- soon or later you'll need to know how).