Monday, November 13, 2017

Benders Decomposition with Generic Callbacks

Brace yourself. This post is a bit long-winded (and arguably geekier than usual, which is saying something). Also, it involves CPLEX 12.8, which will not ship until some time next month.

I have an updated version of an old example, solving a fixed charge transportation problem using Benders decomposition. The example (using Java, naturally) is scattered across several posts, going back a ways:
Now here I am breaking the even year pattern by doing yet another version in 2017. If you want to see the MIP model written out in mathematical notation, it's in the 2012 post. If you want to see the code for the most recent version, which contains both a "manual" decomposition (using callbacks) and a few versions of the new (in 12.7) automatic decomposition feature, there's a link to the repository in the 2016 post. At this point in time, the 2014 post is not worth reading.

Before I get into the updated example code, let me outline some key differences between the old ("legacy") callback system and the new ("generic") callback system.

When is a callback called by CPLEX?
  • Legacy: This is dictated by the type of callback. A branch callback is called when CPLEX is ready to split a node into children, a lazy constraint callback is called when CPLEX identifies what it thinks is a new integer-feasible solution, and so on.
  • Generic: When you attach your callback to the model (IloCplex instance), you specify via a bit mask in which of several "contexts" it should be called. As of this writing (things do have a habit of changing), the contexts are Candidate (a new integer-feasible candidate for incumbent solution has been found), GlobalProgress (progress has occurred at the "global" level -- presumably something you might see in the node log), LocalProgress (a thread has made what it thinks is progress, but has not yet passed it back to the controller), Relaxation (CPLEX has found a solution that might not be integer-feasible, such as the solution to a node LP), ThreadDown (CPLEX deactivated a thread) and ThreadUp (CPLEX activated a thread).
If you are fuzzy about the difference between local and global progress, one example is a thread that finds a new incumbent solution, one that is better than the incumbent that existed when the thread was launched, but not as good as the current incumbent (recently found by a different thread). That is local progress (in the first thread) but not global progress (the incumbent found by the first thread will be ignored when it gets around to reporting it to the controller).
Attaching callbacks to models:
  • Legacy: You attach separate callbacks of each desired type (info callback, branch callback, user cut callback, ...).
  • Generic: You attach a single callback that handles all supported operations. Attaching a second callback automatically detaches the first one. When attaching the callback, you specify via a bitmap from which context(s) the callback is to be called.
Callback classes:
  • Legacy: Your class extends one of the existing callbacks. For instance, if you want a lazy constraint callback, you extend the IloCplex.LazyConstraintCallback class and override the abstract main() method.
  • Generic: Your class implements the IloCplex.Callback.Function interface and overrides the abstract invoke() method.
Calling functions inside the callback:
  • Legacy: You call one of the methods for the class you extended. For example, in a branch callback (extending IloCplex.BranchCallback), you would call one of the overloads of makeBranch() to create a new branch.
  • Generic: When CPLEX calls the invoke() method, it passes in as an argument an instance of IloCplex.Callback.Context. This object has a number of methods you can invoke to find out the context, access a candidate solution (if one exists), get information about what is going on (including the index of the thread in which it was invoked, and the total number of threads in use), etc. Basically, this is your key to the castle.

Benders decomposition


For Benders decomposition, we will be focused on the Candidate context (CPLEX thinks it has an integer-feasible solution, which we need to test) and the ThreadUp and ThreadDown contexts, for reasons I'll eventually get to.

I wrote about thread safety in my previous post. Let me pin down why this is critical when using Benders decomposition. With Benders, we have a master problem and one or more subproblems. (I'll stick with one subproblem here, for concreteness.) The master problem model itself is never modified. Benders cuts are added to the working copy of the master problem, from the callback. Several threads could be generating Benders cuts more or less simultaneously, but it is the responsibility of CPLEX (not your code) to make sure that multiple threads adding cuts don't trip over each other.

On the other hand, the way we generate cuts is to use the proposed incumbent solution to modify the constraint limits in the subproblem (or the subproblem objective, if you're one of those people who prefers to work with the dual of the actual subproblem). It's our code (specifically, the callback) making those changes, and if changes are being made concurrently by multiple threads, bad things can happen. Just to verify this is not hypothetical, I converted my original Benders code to the new callback system and ran it without doing anything special regarding thread safety. It crashed the JVM in a matter of seconds. Lesson learned.

So what are our options? One is to use a single copy of the subproblem and impose a locking mechanism that prevents more than one thread from accessing the subproblem at a time. That turns out to be the easiest to program in Java (at least in my opinion), but it slows the solver down due to threads being blocked. What seems to be the general wisdom for using locks is that you should only use them on chunks of code that execute quickly. Modifying what might be a large LP model, solving it, and finding a dual ray or Farkas certificate if it is infeasible or the dual solution if it is feasible does not meet the "execute quickly" criterion.

Another option is to generate an independent copy of the subproblem each time a thread needs it. That seems wastefully slow to me. A third, more promising approach is to generate one copy of the subproblem per thread and store it, reusing it each time the callback is called in that thread. This again eats CPU cycles generating the subproblem multiple times, but once per thread is a lot less than once each time CPLEX calls the callback. A downside, however, is that this could consume a prohibitive amount of memory if the subproblem is quite large.

The revised Benders example


The latest version of my Benders example is available at https://gitlab.msu.edu/orobworld/BendersExample3. In addition to my code (and Java), you will need CPLEX 12.8 (which I am told is coming out in December) and the open-source Apache Commons CLI library (for processing command line options).

I've added some command line options to let you alter a number of things without having to alter or comment out code: which models to run; what random seed to use; how many trials to run (default 1); whether to show CPLEX log output for neither problem, just the master, or both master and subproblem; whether to show the formulas of the cuts generated (or just settle for messages telling you when a cut is added); and whether to print the full solution (which warehouses are used, what the shipment volumes are). Run the program with "-h" or "--help" on the command line to get the full list of options.

The new code has two versions of the "manual" (callback-based) Benders method. The "-m" command line option runs the default "manual" model, while the "-m2" option runs what I inelegantly named the "thread-local manual" model. The difference is in the approach taken to assure thread safety. I'll briefly describe that below, and of course you can pore over the code if you want more details.

Locking


The default manual model avoids collisions between threads by locking parts of the code so that one thread cannot execute certain portions when any other thread is executing certain portions. There are multiple ways to do this in Java, but a simple one (which I use here) is to synchronize methods. To do this, all you have to do is add the keyword synchronized to the method declarations. (In addition to synchronizing methods, you can synchronize smaller chunks of code, but I did not use that feature.)

My ManualBenders class has a private inner class named BendersCallback. The callback is entered only from the Candidate context. Its sole method is the invoke() method mandated by the IloCplex.Callback.Function interface described earlier. By declaring that method

public synchronized void invoke(final IloCplex.Callback.Context context)

I ensure that no thread can invoke the callback while another thread is in the callback. The good news is that it works (prevents threads from interfering with each other) and it's incredibly easy to do. The bad news is that it results in threads blocking other threads. The subproblem in the example is pretty small. If it were slow to solve (or large enough to make updating bounds meaningfully slower), performance would be even worse.

Thread-specific copies of the subproblem


The "thread-local" model avoids thread collisions by giving each thread a separate copy of the subproblem to play with. I created a class (cleverly named Subproblem) to hold instances of the subproblem. The callback (instance of class BendersCallback2) is now entered from three different contexts, ThreadUp, ThreadDown and Candidate. As before, in the Candidate context we do the usual things: use the candidate solution to the master problem to modify constraint limits in the subproblem; solve the subproblem; depending on whether it is feasible, and what its optimal objective value is, either accept the new incumbent or reject it by adding a cut. We use calls in the ThreadUp context to create and store a copy of the subproblem for the newly activated thread, and similarly we use calls in the ThreadDown context to delete the local subproblem copy and recover its memory.

So where do we store subproblem copies so that only the thread invoking the callback can see its designated copy? In their BendersATSP2.java example code, IBM creates a vector, of length equal to the maximum number of threads that will be used, to store subproblems. The context argument to the callback provides a method, context.getIntInfo(IloCplex.Callback.Context.Info.ThreadId), that returns the zero-based index number of thread invoking the callback, which is used to pluck the correct copy of the subproblem from the vector of copies.

I took a slightly different approach in my code, using a generic Java class named ThreadLocal that exists precisely to hold objects local to a given thread. My BendersCallback2 class contains a private field declared

private final ThreadLocal<Subproblem> subproblem

to hold the copy of the subproblem belonging to a given thread. When the callback is called from the ThreadUp context, subproblem.set() is used to store a newly constructed copy of the subproblem; when called from the Candidate context, subproblem.get() is used to access the copy of the subproblem belonging to the thread. (Since this is all rather new, both to CPLEX and to me, and since I have a suspicious nature, I stuck some extra code in the example to ensure that subproblem is empty when called from ThreadUp and not empty when called from Candidate or ThreadDown. There are comments indicating the superfluous code. Feel free to omit it from your own projects.)

I thought that when a thread was deleted, the subproblem variable would be deleted. Since subproblem is (I think) the only pointer to the Subproblem instance for that thread, I expected the memory used by the Subproblem to be recovered during garbage collection. Apparently I was wrong: my initial attempt leaked memory (enough to shut down the JVM after solving a few problems). So I added an invocation of the callback from the ThreadDown context (thread deactivation), at which point the callback deletes the Subproblem instance. That seems to have cured the memory leak. So, in summary, the callback is called in three contexts:
  • ThreadUp -- create a copy of the subproblem for the thread;
  • Candidate -- test a candidate incumbent and, if necessary, generate a Benders cut; and
  • ThreadDown -- delete the copy of the subproblem belonging to that thread.
I'm comfortable with this approach, but you might prefer the subproblem vector used in the BendersATSP2.java example.

Performance Comparison


I ran five solution methods (Benders with locking, Benders with copies of the subproblem for each thread, and three versions of the automatic Benders decomposition added to CPLEX 12.7) on 10 instances of the fixed charge transportation problem. That's a small sample, based on a single type of problem, on a single machine (four cores, 4 GB of RAM), so I wouldn't read too much into the results. Nonetheless, I thought I would close by comparing solution times. The three automatic methods provided by CPLEX ("annotated", "automatic" and "workers", described in my 2016 post) had virtually indistinguishable times, so to make the plot more reasonable I am including only the "automatic" method (labeled "Auto"), along with my two approaches ("Local" for the method using copies of the subproblem for each thread and "Locked" for the method using synchronization).

The plot below shows, as a function of wall clock time, the fraction of trials that reached proven optimality by the specified time. The automatic approach and my manual approach with thread-local subproblems were competitive (with the automatic approach having a slight advantage), while the manual approach with synchronized class methods lagged rather badly. Although I cannot say with certainty, I am rather confident that is a result of threads engaged in solving a subproblem blocking other threads wanting to do the same.

performance plot (completion rate versus run time, by method)

Finally, just to drive home the realities of multiple threads, here is a sample of the output generated by the synchronized (locked) approach during one run.

>>> Adding optimality cut
>>> Accepting new incumbent with value 3598.1162756564636
>>> Accepting new incumbent with value 3598.1162756564636
>>> Accepting new incumbent with value 4415.750134379851
>>> Accepting new incumbent with value 4415.750134379851
>>> Accepting new incumbent with value 4411.334611014113
>>> Accepting new incumbent with value 4411.334611014113
>>> Accepting new incumbent with value 4700.008131082763
>>> Accepting new incumbent with value 4700.008131082763
>>> Adding optimality cut
>>> Adding optimality cut
>>> Adding optimality cut
Final solver status = Optimal
# of warehouses used = 14
Total cost = 3598.1162756564636
Solution time (seconds) = 143.090

The objective here is minimization. The messages beginning ">>>" are printed by the callback. Notice that after the optimal solution is found and accepted (in two different threads?), provably suboptimal solutions (solutions with inferior objective values) appear to be accepted several times. This also happens with the thread-local approach. It may well be happening in the various automated Benders approaches, but since they do not print progress messages there is no way to tell. This illustrates the difference between "local" and "global" progress. It happens because the various worker threads do not communicate with each other, and communicate only sporadically with the controller thread. So if thread 3 finds the optimal solution and has it accepted, thread 1 does not know this and continues cranking through its portion of the search tree, finding new "incumbents" that are improvements over the best solution thread 1 has seen but not necessarily improvements over the best global solution. The inferior solutions are accepted in the threads in which they were found, but the controller knows not to accept them as incumbents (because it already has a better incumbent). This behavior is not an error; it's just another indication why adding threads does not improve performance proportionally.

14 comments:

  1. Dear Professor,
    Thanks a lot for your helpful blogpost. It is a very clear introduction to the new setup of CPLEX callbacks.
    I also liked the last comment you made, that different threads can find suboptimal feasible solutions, because of a lack of communication between the threads (which in my view is fine because this would also slow down the solving process). However, in some of the projects that I've done, this issue was present as well and consumed a lot of time. I've asked IBM and there is an easy way around this, if you don't care about deterministic results, which might be interesting: https://www.ibm.com/developerworks/community/forums/html/topic?id=e72f622a-b899-4e96-a789-254587a39c7d&ps=25

    Best,
    Gert-Jaap

    ReplyDelete
    Replies
    1. Thanks, Gert-Jaap. That discussion is interesting, and it suggests another possibility to me. Again, we start with a global variable storing the objective value of the best integer-feasible solution so far. I'll also assume that the objective function contains a single variable, say minimize z. (This is to reduce dual degeneracy caused by the trick I'm going to suggest.) Each time the the global objective value changes, set a boolean flag for each thread. Each time a lazy constraint callback, the callback checks the flag corresponding to its thread. If it sees the flag is set, it clears the flag and adds a lazy constraint of the form z <= latest incumbent value. That will not only prevent repetitive generation of cuts at a node that should be pruned, it may also allow faster pruning of other nodes.

      I thought I would also mention that the business about cloning in that discussion only applies to legacy callbacks. Generic callbacks are not cloned by CPLEX, which is one reason that they are more tempermental about thread safety.

      Delete
  2. Professor Rubin,

    First I used the original callbcak version of the code for my problem and it works fine. Then, I also coded the geveric callback version. But it has some problems.

    For some reason the optimality cuts are not being added (master problem obj. function does not increase from zero).

    It starts with: "$$$ Thread up occurred (thread 0/8) with no subproblem in place."

    Then every time master problem is solved: "$$$ A candidate was supplied (thread 0/8) with a subproblem already in place."
    The optimality cut is reported but seems not to be added. Master problem's obj func value stays at zero. This goes on for a few iterations.

    Finally it ends with:
    $$$ Thread up occurred (thread 3/8) with no subproblem in place.
    $$$ Thread up occurred (thread 2/8) with no subproblem in place.
    $$$ Thread up occurred (thread 1/8) with no subproblem in place.
    $$$ Thread up occurred (thread 7/8) with no subproblem in place.
    $$$ Thread up occurred (thread 6/8) with no subproblem in place.
    $$$ Thread up occurred (thread 5/8) with no subproblem in place.
    $$$ Thread up occurred (thread 4/8) with no subproblem in place.

    And an error window is opened that says: "Java(TM) SE Binary Platform has stopped working"

    Do you have any suggestions as to what the problem could be?

    Thank you.

    ReplyDelete
    Replies
    1. I suspect it's a bug in the way you initialize your callback. Unfortunately, I can't say what that would be. The comment section of a blog is not conducive for debugging software.

      Delete
  3. Hi again Professor Rubin. I am trying to get some global info from the problem such as the iteration number, best incumbent objective function value (upper bound for a minimization problem), best bound (lower bound) and hence relative gap, total number of iterations. I am trying to use context to retrieve that information, but it does not work. For example to get the iteration number I used context.getIntInfo(IloCplex.Callback.Context.Info.IterationCount), but I guess this is the simplex iteration number (I don't know why that would be needed). I would be happy if you could help me on that. Thank you.

    Best regards,

    Vedat

    ReplyDelete
    Replies
    1. This is consistent with the log output, where the iteration count is simplex iterations. If what you mean by "iteration number" is the number of nodes processed so far, that is given by IloCplex.Callback.Context.Info.NodeCount.

      Delete
  4. Dear Prof Rubin

    Thanks for your valuable blog. I have a couple of questions about threads considerations when I use the callbacks. I coded the single-search tree version of Benders decomposition in CPLEX 12.6 and I used a lazy constraint callback to add Benders cuts. When I use CPLEX 12.6 the code without the following part of code, forcing the use one thread to solve the master problem, crashes:
    cplex_master.setParam(IloCplex::Threads, 1).

    But when I use the same code in CPLEX 12.9 and I cancel the mentioned line of code, the code works properly till optimality. 1) What is the meaning of this? does this mean that CPLEX 12.9 handles the thread safety issue by itself to some extent? or my type of problem does not need the generic callback thread safety considerations mentioned in this page? 2) essentially, how can I find out the code is run on the single or multi-thread? Note that I use C++ to code and I have multiple linear Benders subproblems.
    Also, when I am using CPLEX 12.9, the runtime with and without the mentioned part of code is almost the same. 3) Is this meaningful?

    4) I have another question as well, the annotated Benders decomposition (built-in benders) is run in multi-thread mode on the single one?

    Thank you so much in advance.
    AmiZi

    ReplyDelete
    Replies
    1. If you are running the same code with CPLEX 12.6 and 12.9, then you must be using legacy callbacks. Generic callbacks do not exist in CPLEX 12.6. So you may bumped into a bug in 12.6 that was fixed by 12.9, or your code may have a thread-related bug that triggers with 12.6 but for some reason does not trigger (at least in the problem you are solving) with 12.9.

      To answer your second question (if I understand it correctly), the CPLEX log will tell you how many threads it is using. For your third question, if 12.9 is using multiple threads without that line (check the log) and takes about the same amount of time, then multi-threading is not very useful for that example. Using multiple threads incurs some overhead, and apparently you are not getting enough benefit from parallelism to offset that cost. This can be problem-specific, so the next time you use that code (with a different problem) things might change. As far as annotated Benders goes, I think that the number of threads used is based on your setting of the thread count parameter (just like any other CPLEX run).

      Delete
  5. Thank you so much for your comprehensive response. There is still one thing unclear to me. As it is mentioned in the Xavier Nodet's slides regarding Legacy Callbacks, they default to single-threaded mode, but, even if, I manually set them to be run with higher than one thread, I am still getting correct results without any code crash, although the runtime is not that much different. 1) Is Cplex running in parallel mode or it ignores my parameters setting as its default is single-threaded?

    So, my question is that 2) as I am able to run a legacy callback in parallel mode, is there still any point in using Generic one? are their performances the same in this mode (for my problem)?

    I would be thankful if you could clarify my understanding.

    Regards

    ReplyDelete
    Replies
    1. As I said before, the CPLEX will tell you how many threads it is using. That is true regardless of whether or not you are using generic callbacks. As to your not getting any crashes with generic callbacks, as long as your callback does not alter any storage outside the callback itself (i.e., change values of variables that live outside the callback), thread safety should not be an issue. It's when you alter global storage that you have to be more careful with generic callbacks than with their legacy counterparts.

      As far as relative performance, legacy callbacks disable dynamic search while generic callbacks allow CPLEX to use dynamic search. If you disable dynamic search (via a parameter), that does not matter. Dynamic search tends to improve solution time (not always guaranteed, but definitely a tendency), so the fact that generic callbacks allow CPLEX to continue to use it tends to make them somewhat faster (in some, maybe most, cases, but not in all cases).

      Delete
  6. Dear Professor Robin,
    I learned a lot from your blogs. In my problem, the subproblems are time-consuming. I have S independent scenario subproblems. I run the model with classic BD algorithm (iteratively solving MP and SP) and branch and benders cut (B&BC). B&BC needs more CPU time because it calls the subproblem more in the lazy and user callbacks. I was thinking about using parallel computing for solving the S subproblems using OpenMP. I am wondering if generic callbacks perform the same as parallel computing or not (each thread can solve one subproblem?). Which one is easier to learn in your opinion: OpenMP or generic callbacks?
    Thank you so much in advance.

    ReplyDelete
    Replies
    1. To start, I have no experience at all with OpenMP, so I cannot offer any opinions there.

      I'm not sure why subproblems would be solved in the user cut callback. You only want to generate Benders cuts in the lazy constraint callback (or in the candidate context in a generic callback).

      Generic callbacks will not automatically solve subproblems in parallel. Depending on what programming language you are using, you can probably create parallel worker threads (in either a lazy constraint callback or generic callback) and have each thread solve a subproblem.

      I'm not sure you necessarily want to solve every subproblem every time the master spits up a candidate solution. You might try solving subproblems sequentially inside the callback. The first time one finds a cut that eliminates the candidate, add that cut and go back to the master (skipping the remaining subproblems). I might be tempted to solve the subproblems in a loop, starting with the subproblem after the one that generated the last cut, or maybe using a randomized order each time. That may or may not increase the number of master problem nodes visited, but might reduce overall compute time.

      Delete
    2. Thank you so much for your tips. I add user cut callback when the solution is fractional to accelerate the convergence of the algorithm. I tried your suggestion to add the first violated cut and skipping the remaining subproblems but the objective function and decision variables are different.

      Delete
    3. I don't know what you mean by "are different" -- different than what? If you are saying that the final solution has the wrong objective value, that should not happen so long as you always add at least one violated cut when any subproblem has one.

      Delete

Due to intermittent spamming, comments are being moderated. If this is your first time commenting on the blog, please read the Ground Rules for Comments. In particular, if you want to ask an operations research-related question not relevant to this post, consider asking it on Operations Research Stack Exchange.