Friday, June 29, 2018

Does Computer Science Help with OR?

Fair warning: tl/dr.

After reading a blog post yesterday by John D. Cook, "Does computer science help you program?", I decided to throw in my two cents (convert to euros at your own risk) on a related topic: does computer science (which I will extend to including programming) help you as an OR/IE/management science/analytics professional? What follows is a mix of opinion and experience, and hence cannot be tested rigorously. (In other words, it's a blog post.)

Programming


The obvious starting point is programming. Do you need to know how to program to work in OR (and related fields -- to cut down on typos, I'm going to use "OR" as an umbrella acronym)? In many (most?) cases, yes! Some people with boring jobs may only have to shove data into a statistics program with a graphical user interface and push buttons, or write optimization models in a high level language (technically a form of coding, but I'll discount that) and summon the optimizer, but many will have to do tasks either too complicated for high level development environments, too quirky, or just not suited to one. Unless you magically find yourself endowed with limitless programming support, sooner or later (most likely sooner) you are likely to need to do some coding. (Note to faculty: even if you have hot and cold running students, don't assume that means adequate programming support. I taught simulation coding to doctoral students in a previous life. One of them wrote a program that gave "spaghetti code" a whole new meaning. It looked like a street map of downtown Rome after an earthquake.)

I've done a ton of coding in my time, so perhaps I'm a bit biased. Still, I cannot recall the last OR project I did (research, application, teaching tool or whatever) that did not involve significant coding.

Languages


John D. Cook once did a blog post (I don't have the link at hand) about how many languages an analyst or applied mathematician needed to know. I forget the details, but the gist was "one for every type of programming task you do". So here's my toolkit:
  • one scripting language, for quick or one-off tasks (R for me; others may prefer Python or whatever);
  • one language suited for data manipulation/analysis/graphic (R again for me);
  • one language for "substantial" computational tasks (Java for me, C++ for a lot of people, Julia for some recent adopters, MATLAB or equivalent for some people, ...); and
  • one language for dealing with databases (SQL for me, although "SQL" is like saying "Indian" ... there are a lot of "dialects").
In case you're wondering how the first two bullets differ (since I use R for both), here's a recent example of the first bullet. A coauthor-to-be and I received a data set from the author of an earlier paper. One file was a MATLAB script with data embedded, and another looked like the output of a MATLAB script. We needed to extract the parts relevant to our work from both, smush them together in a reasonable way, and reformulate the useful parts to the file format our program expects. That does not exactly call for an object-oriented program, and using a script allowed me to test individual bits and see if they did what I expected (which was not always the case).

Parallel computation


I went a long time knowing hardly anything about this, because I was using earlier computing devices where parallelism was out of the question. Today, though, it is pretty easy to find yourself tackling problems where parallel threads or parallel processes are hard to avoid. This includes, but is not limited to, writing programs with a graphical user interface, where doing heavy number crunching in the main thread will freeze the GUI and seriously frustrate the user. I just finished (knock on virtual wood) a Java program for a research project. During the late stages, while tweaking some code related to a heuristic that runs parallel threads and also uses CPLEX, I managed to (a) fill up main memory once (resulting in disk thrashing and essentially paralyzing not only the program interface but the operating system interface ... meaning I couldn't even kill the program) and (b) crash the Java virtual machine (three times, actually). So, trust me, understanding parallel processing can really be important.


"Theoretical" Computer Science


This is what John Cook was after in his informal poll. Here, my answer is that some parts are very useful, others pretty much not useful at all, and maybe some stuff in between.

Data structures


As a graduate student (in math, on the quarter system), I took a three course sequence required for junior year computer science majors. One course concentrated on how to write operating systems. It's perhaps useful to have a basic knowledge of what an operating system does, but I'm pretty sure an OR person can get that from reading computer magazines or something, without taking a course in it. I've totally forgotten what another one of the courses covered, which suggests that maybe it was not crucial to my professional life.

The third course focused on data structures, and while much of what I learned has probably become obsolete (does anyone still use circular buffers?), it's useful both to know the basics and to understand some of the concerns related to different data structures. More than once, while hacking Java code, I've had to give some thought to whether I would be doing a lot of "give me element 7" (which favors array-like structures) versus "give me the smallest element" (favoring tree-like structures) versus "the same thing is going to get added a zillion times, and you only need to keep one copy" (set interfaces, built over who knows what kind of structure).

Complexity


Another computer science topic you need to know is computational complexity, for a variety of reasons. The first is that, at least if you work in optimization, you will be bombarded by statements about how this problem or that is NP-ridiculous, or this algorithm or that is NP-annoying. (The latter is total b.s. -- NP-hardness is a property of the problem, not the algorithm. Nonetheless, I occasionally see statements like that.) It's important to know both what NP-hardness is (a sign that medium to large problem instances might get pretty annoying) and what it is not (a sign of the apocalypse, or an excuse for skipping the attempt to get an exact solution and immediately dragging out your favorite metaheuristic). You should also understand what a proof of NP-hardness entails, which is more than just saying "it's an integer program".

Beyond NP silliness, though, you need to understand what $O(\dots)$ complexity means, and in particular the fact that an $O(n^2)$ algorithm is slower than an $O(n \log(n))$ algorithm for large $n$, but possibly faster for small $n$. This can help in choosing alternative ways to do computational tasks in your own code.

Finite precision


This may turn up in a numerical analysis class in a mathematics program, or in a computer science course, but either way you really need to understand what rounding and truncation error are, how accurate double-precision floating point arithmetic is, etc. First, this will help you avoid embarrassment by asking questions like "Why does the solver say my 0-1 variable is 0.999999999999, which is clearly an error?". Second, it will give you an appreciation for why doing things with "stiff" matrices can create remarkably goofy results from very straightforward looking problems. That, in turn, will help you understand why "big M" methods are both a boon and a curse in optimization.

I may have more to say on this at a later date, but right now my computer is running out of electrons, so I'll quit here.

Tuesday, June 19, 2018

Callback Cuts That Repeat

The following post is specific to the CPLEX integer programming solver. I have no idea whether it applies to other solvers, or even which other solver have cut callbacks.

Every so often, a user will discover that a callback routine they wrote has "rediscovered" a cut it previously generated. This can be a bit concerning at first, for a few reasons: it seems implausible, and therefore raises concerns of a bug someplace; it represents repeated work, and thus wasted effort; and it suggests at least the possibility of getting stuck in a loop, the programmatic equivalent of "Groundhog Day". (Sorry, couldn't resist.) As it happens, though, repeating the same cut can legitimately happen (though hopefully not too often).

First, I should probably define what I mean by callbacks here (although I'm tempted to assume that if you don't know what I mean, you've already stopped reading). If you want to get your geek on, feel free to wade through the Wikipedia explanation of a callback function. In the context of CPLEX, what I'm referring to is a user-written function that CPLEX will call at certain points in the search process. I will focus on functions called when CPLEX is generating cuts to tighten the bound at a given node of the search tree (a "user cut callback" in their terminology) or when CPLEX thinks it has identified an integer-feasible solution better than the current incumbent (a "lazy constraint callback"). That terminology pertains to versions of CPLEX prior to 12.8, when those would be two separate (now "legacy") callbacks. As of CPLEX version 12.8, there is a single ("generic") type of callback, but generic callbacks continue to be called from multiple "contexts", including those two (solving a node LP, checking a new candidate solution).

The purpose here is to let the user either generate a bound-tightening constraint ("user cut") using particular knowledge of the problem structure, or to vet a candidate solution and, if unsuitable, issue a new constraint ("lazy constraint") that cuts off that solution, again based on problem information not part of the original IP model. The latter is particularly common with decomposition techniques such as Benders decomposition.

So why would the same user cut or lazy constraint be generated more than once (other than due to a bug)? There are at least two, and possibly three, explanations.

Local versus Global


A user cut can be either local or global. The difference is whether the cut is valid in the original model ("global") or whether it is contingent on branching decisions that led to the current node ("local"). The user can specify in the callback whether a new cut is global or local. If the cut is specified as local, it will be enforced only at descendants of the current node.

That said, it is possible that a local cut might be valid at more than one node of the search tree, in which case it might be rediscovered when those other nodes are visited. In the worst case, if the cut is actually globally valid but for some reason added with the local flag set, it may be rediscovered quite a few times.

Parallel Threading


On a system with multiple cores (or multiple processors), using parallel threads can speed CPLEX up. Parallel threads, though, are probably the most common cause for cuts and lazy constraints repeating.

The issue is that threads operate somewhat independently and only synchronize periodically. So suppose that thread A triggers a callback that generates a globally valid user cut or lazy constraint, which is immediately added to the problem A is solving. Thread B, which is working on a somewhat different problem (from a different part of the search tree), is unaware of the new cut/constraint until it reaches a synchronization point (and finds out what A and any other sibling threads have been up to). Prior to that, B might stumble on the same cut/constraint. Since A and B are calling the same user callback function, if the user is tracking what has been generated inside that function, the function will register a repetition. This is normal (and harmless).

Cut Tables


This last explanation is one I am not entirely sure about. When cuts and lazy constraints are added, CPLEX stores them internally in some sort of table. I believe that it is possible in some circumstances for a callback function to be called before the table is (fully) checked, in which case the callback might generate a cut or constraint that is already in the table. Since this deals with the internal workings of CPLEX (the secret sauce), I don't know first-hand if this is true or not ... but if it is, it is again slightly wasteful of time but generally harmless.

User Error


Of course, none of this precludes the possibility of a bug in the user's code. If, for example, the user reacts to a candidate solution with a lazy constraint that is intended to cut off that solution but does not (due to incorrect formulation or construction), CPLEX will register the user constraint, notice that the solution is still apparently valid, and give the callback another crack at it. (At least that is how it worked with legacy callbacks, and I think that is how it works with generics.) Seeing the same solution, the user callback might generate the same (incorrect) lazy constraints, and off the code goes, chasing its own tail.