I have an updated version of an old example, solving a fixed charge transportation problem using Benders decomposition. The example (using Java, naturally) is scattered across several posts, going back a ways:
- the original example was posted in 2012;
- an updated version was posted in 2014; and
- when IBM added automated Benders decomposition to CPLEX 12.7 in 2016, I revised the example yet again.
Before I get into the updated example code, let me outline some key differences between the old ("legacy") callback system and the new ("generic") callback system.
When is a callback called by CPLEX?
- Legacy: This is dictated by the type of callback. A branch callback is called when CPLEX is ready to split a node into children, a lazy constraint callback is called when CPLEX identifies what it thinks is a new integer-feasible solution, and so on.
- Generic: When you attach your callback to the model (IloCplex instance), you specify via a bit mask in which of several "contexts" it should be called. As of this writing (things do have a habit of changing), the contexts are Candidate (a new integer-feasible candidate for incumbent solution has been found), GlobalProgress (progress has occurred at the "global" level -- presumably something you might see in the node log), LocalProgress (a thread has made what it thinks is progress, but has not yet passed it back to the controller), Relaxation (CPLEX has found a solution that might not be integer-feasible, such as the solution to a node LP), ThreadDown (CPLEX deactivated a thread) and ThreadUp (CPLEX activated a thread).
- Legacy: You attach separate callbacks of each desired type (info callback, branch callback, user cut callback, ...).
- Generic: You attach a single callback that handles all supported operations. Attaching a second callback automatically detaches the first one. When attaching the callback, you specify via a bitmap from which context(s) the callback is to be called.
- Legacy: Your class extends one of the existing callbacks. For instance, if you want a lazy constraint callback, you extend the IloCplex.LazyConstraintCallback class and override the abstract main() method.
- Generic: Your class implements the IloCplex.Callback.Function interface and overrides the abstract invoke() method.
- Legacy: You call one of the methods for the class you extended. For example, in a branch callback (extending IloCplex.BranchCallback), you would call one of the overloads of makeBranch() to create a new branch.
- Generic: When CPLEX calls the invoke() method, it passes in as an argument an instance of IloCplex.Callback.Context. This object has a number of methods you can invoke to find out the context, access a candidate solution (if one exists), get information about what is going on (including the index of the thread in which it was invoked, and the total number of threads in use), etc. Basically, this is your key to the castle.
Benders decomposition
For Benders decomposition, we will be focused on the Candidate context (CPLEX thinks it has an integer-feasible solution, which we need to test) and the ThreadUp and ThreadDown contexts, for reasons I'll eventually get to.
I wrote about thread safety in my previous post. Let me pin down why this is critical when using Benders decomposition. With Benders, we have a master problem and one or more subproblems. (I'll stick with one subproblem here, for concreteness.) The master problem model itself is never modified. Benders cuts are added to the working copy of the master problem, from the callback. Several threads could be generating Benders cuts more or less simultaneously, but it is the responsibility of CPLEX (not your code) to make sure that multiple threads adding cuts don't trip over each other.
On the other hand, the way we generate cuts is to use the proposed incumbent solution to modify the constraint limits in the subproblem (or the subproblem objective, if you're one of those people who prefers to work with the dual of the actual subproblem). It's our code (specifically, the callback) making those changes, and if changes are being made concurrently by multiple threads, bad things can happen. Just to verify this is not hypothetical, I converted my original Benders code to the new callback system and ran it without doing anything special regarding thread safety. It crashed the JVM in a matter of seconds. Lesson learned.
So what are our options? One is to use a single copy of the subproblem and impose a locking mechanism that prevents more than one thread from accessing the subproblem at a time. That turns out to be the easiest to program in Java (at least in my opinion), but it slows the solver down due to threads being blocked. What seems to be the general wisdom for using locks is that you should only use them on chunks of code that execute quickly. Modifying what might be a large LP model, solving it, and finding a dual ray or Farkas certificate if it is infeasible or the dual solution if it is feasible does not meet the "execute quickly" criterion.
Another option is to generate an independent copy of the subproblem each time a thread needs it. That seems wastefully slow to me. A third, more promising approach is to generate one copy of the subproblem per thread and store it, reusing it each time the callback is called in that thread. This again eats CPU cycles generating the subproblem multiple times, but once per thread is a lot less than once each time CPLEX calls the callback. A downside, however, is that this could consume a prohibitive amount of memory if the subproblem is quite large.
The revised Benders example
The latest version of my Benders example is available at https://gitlab.msu.edu/orobworld/BendersExample3. In addition to my code (and Java), you will need CPLEX 12.8 (which I am told is coming out in December) and the open-source Apache Commons CLI library (for processing command line options).
I've added some command line options to let you alter a number of things without having to alter or comment out code: which models to run; what random seed to use; how many trials to run (default 1); whether to show CPLEX log output for neither problem, just the master, or both master and subproblem; whether to show the formulas of the cuts generated (or just settle for messages telling you when a cut is added); and whether to print the full solution (which warehouses are used, what the shipment volumes are). Run the program with "-h" or "--help" on the command line to get the full list of options.
The new code has two versions of the "manual" (callback-based) Benders method. The "-m" command line option runs the default "manual" model, while the "-m2" option runs what I inelegantly named the "thread-local manual" model. The difference is in the approach taken to assure thread safety. I'll briefly describe that below, and of course you can pore over the code if you want more details.
Locking
The default manual model avoids collisions between threads by locking parts of the code so that one thread cannot execute certain portions when any other thread is executing certain portions. There are multiple ways to do this in Java, but a simple one (which I use here) is to synchronize methods. To do this, all you have to do is add the keyword synchronized to the method declarations. (In addition to synchronizing methods, you can synchronize smaller chunks of code, but I did not use that feature.)
My ManualBenders class has a private inner class named BendersCallback. The callback is entered only from the Candidate context. Its sole method is the invoke() method mandated by the IloCplex.Callback.Function interface described earlier. By declaring that method
public synchronized void invoke(final IloCplex.Callback.Context context)
I ensure that no thread can invoke the callback while another thread is in the callback. The good news is that it works (prevents threads from interfering with each other) and it's incredibly easy to do. The bad news is that it results in threads blocking other threads. The subproblem in the example is pretty small. If it were slow to solve (or large enough to make updating bounds meaningfully slower), performance would be even worse.
Thread-specific copies of the subproblem
The "thread-local" model avoids thread collisions by giving each thread a separate copy of the subproblem to play with. I created a class (cleverly named Subproblem) to hold instances of the subproblem. The callback (instance of class BendersCallback2) is now entered from three different contexts, ThreadUp, ThreadDown and Candidate. As before, in the Candidate context we do the usual things: use the candidate solution to the master problem to modify constraint limits in the subproblem; solve the subproblem; depending on whether it is feasible, and what its optimal objective value is, either accept the new incumbent or reject it by adding a cut. We use calls in the ThreadUp context to create and store a copy of the subproblem for the newly activated thread, and similarly we use calls in the ThreadDown context to delete the local subproblem copy and recover its memory.
So where do we store subproblem copies so that only the thread invoking the callback can see its designated copy? In their BendersATSP2.java example code, IBM creates a vector, of length equal to the maximum number of threads that will be used, to store subproblems. The context argument to the callback provides a method, context.getIntInfo(IloCplex.Callback.Context.Info.ThreadId), that returns the zero-based index number of thread invoking the callback, which is used to pluck the correct copy of the subproblem from the vector of copies.
I took a slightly different approach in my code, using a generic Java class named ThreadLocal that exists precisely to hold objects local to a given thread. My BendersCallback2 class contains a private field declared
private final ThreadLocal<Subproblem> subproblem
to hold the copy of the subproblem belonging to a given thread. When the callback is called from the ThreadUp context, subproblem.set() is used to store a newly constructed copy of the subproblem; when called from the Candidate context, subproblem.get() is used to access the copy of the subproblem belonging to the thread. (Since this is all rather new, both to CPLEX and to me, and since I have a suspicious nature, I stuck some extra code in the example to ensure that subproblem is empty when called from ThreadUp and not empty when called from Candidate or ThreadDown. There are comments indicating the superfluous code. Feel free to omit it from your own projects.)
I thought that when a thread was deleted, the subproblem variable would be deleted. Since subproblem is (I think) the only pointer to the Subproblem instance for that thread, I expected the memory used by the Subproblem to be recovered during garbage collection. Apparently I was wrong: my initial attempt leaked memory (enough to shut down the JVM after solving a few problems). So I added an invocation of the callback from the ThreadDown context (thread deactivation), at which point the callback deletes the Subproblem instance. That seems to have cured the memory leak. So, in summary, the callback is called in three contexts:
- ThreadUp -- create a copy of the subproblem for the thread;
- Candidate -- test a candidate incumbent and, if necessary, generate a Benders cut; and
- ThreadDown -- delete the copy of the subproblem belonging to that thread.
Performance Comparison
I ran five solution methods (Benders with locking, Benders with copies of the subproblem for each thread, and three versions of the automatic Benders decomposition added to CPLEX 12.7) on 10 instances of the fixed charge transportation problem. That's a small sample, based on a single type of problem, on a single machine (four cores, 4 GB of RAM), so I wouldn't read too much into the results. Nonetheless, I thought I would close by comparing solution times. The three automatic methods provided by CPLEX ("annotated", "automatic" and "workers", described in my 2016 post) had virtually indistinguishable times, so to make the plot more reasonable I am including only the "automatic" method (labeled "Auto"), along with my two approaches ("Local" for the method using copies of the subproblem for each thread and "Locked" for the method using synchronization).
The plot below shows, as a function of wall clock time, the fraction of trials that reached proven optimality by the specified time. The automatic approach and my manual approach with thread-local subproblems were competitive (with the automatic approach having a slight advantage), while the manual approach with synchronized class methods lagged rather badly. Although I cannot say with certainty, I am rather confident that is a result of threads engaged in solving a subproblem blocking other threads wanting to do the same.
Finally, just to drive home the realities of multiple threads, here is a sample of the output generated by the synchronized (locked) approach during one run.
>>> Adding optimality cut >>> Accepting new incumbent with value 3598.1162756564636 >>> Accepting new incumbent with value 3598.1162756564636 >>> Accepting new incumbent with value 4415.750134379851 >>> Accepting new incumbent with value 4415.750134379851 >>> Accepting new incumbent with value 4411.334611014113 >>> Accepting new incumbent with value 4411.334611014113 >>> Accepting new incumbent with value 4700.008131082763 >>> Accepting new incumbent with value 4700.008131082763 >>> Adding optimality cut >>> Adding optimality cut >>> Adding optimality cut Final solver status = Optimal # of warehouses used = 14 Total cost = 3598.1162756564636 Solution time (seconds) = 143.090
The objective here is minimization. The messages beginning ">>>" are printed by the callback. Notice that after the optimal solution is found and accepted (in two different threads?), provably suboptimal solutions (solutions with inferior objective values) appear to be accepted several times. This also happens with the thread-local approach. It may well be happening in the various automated Benders approaches, but since they do not print progress messages there is no way to tell. This illustrates the difference between "local" and "global" progress. It happens because the various worker threads do not communicate with each other, and communicate only sporadically with the controller thread. So if thread 3 finds the optimal solution and has it accepted, thread 1 does not know this and continues cranking through its portion of the search tree, finding new "incumbents" that are improvements over the best solution thread 1 has seen but not necessarily improvements over the best global solution. The inferior solutions are accepted in the threads in which they were found, but the controller knows not to accept them as incumbents (because it already has a better incumbent). This behavior is not an error; it's just another indication why adding threads does not improve performance proportionally.