Friday, June 29, 2018

Does Computer Science Help with OR?

Fair warning: tl/dr.

After reading a blog post yesterday by John D. Cook, "Does computer science help you program?", I decided to throw in my two cents (convert to euros at your own risk) on a related topic: does computer science (which I will extend to including programming) help you as an OR/IE/management science/analytics professional? What follows is a mix of opinion and experience, and hence cannot be tested rigorously. (In other words, it's a blog post.)

Programming


The obvious starting point is programming. Do you need to know how to program to work in OR (and related fields -- to cut down on typos, I'm going to use "OR" as an umbrella acronym)? In many (most?) cases, yes! Some people with boring jobs may only have to shove data into a statistics program with a graphical user interface and push buttons, or write optimization models in a high level language (technically a form of coding, but I'll discount that) and summon the optimizer, but many will have to do tasks either too complicated for high level development environments, too quirky, or just not suited to one. Unless you magically find yourself endowed with limitless programming support, sooner or later (most likely sooner) you are likely to need to do some coding. (Note to faculty: even if you have hot and cold running students, don't assume that means adequate programming support. I taught simulation coding to doctoral students in a previous life. One of them wrote a program that gave "spaghetti code" a whole new meaning. It looked like a street map of downtown Rome after an earthquake.)

I've done a ton of coding in my time, so perhaps I'm a bit biased. Still, I cannot recall the last OR project I did (research, application, teaching tool or whatever) that did not involve significant coding.

Languages


John D. Cook once did a blog post (I don't have the link at hand) about how many languages an analyst or applied mathematician needed to know. I forget the details, but the gist was "one for every type of programming task you do". So here's my toolkit:
  • one scripting language, for quick or one-off tasks (R for me; others may prefer Python or whatever);
  • one language suited for data manipulation/analysis/graphic (R again for me);
  • one language for "substantial" computational tasks (Java for me, C++ for a lot of people, Julia for some recent adopters, MATLAB or equivalent for some people, ...); and
  • one language for dealing with databases (SQL for me, although "SQL" is like saying "Indian" ... there are a lot of "dialects").
In case you're wondering how the first two bullets differ (since I use R for both), here's a recent example of the first bullet. A coauthor-to-be and I received a data set from the author of an earlier paper. One file was a MATLAB script with data embedded, and another looked like the output of a MATLAB script. We needed to extract the parts relevant to our work from both, smush them together in a reasonable way, and reformulate the useful parts to the file format our program expects. That does not exactly call for an object-oriented program, and using a script allowed me to test individual bits and see if they did what I expected (which was not always the case).

Parallel computation


I went a long time knowing hardly anything about this, because I was using earlier computing devices where parallelism was out of the question. Today, though, it is pretty easy to find yourself tackling problems where parallel threads or parallel processes are hard to avoid. This includes, but is not limited to, writing programs with a graphical user interface, where doing heavy number crunching in the main thread will freeze the GUI and seriously frustrate the user. I just finished (knock on virtual wood) a Java program for a research project. During the late stages, while tweaking some code related to a heuristic that runs parallel threads and also uses CPLEX, I managed to (a) fill up main memory once (resulting in disk thrashing and essentially paralyzing not only the program interface but the operating system interface ... meaning I couldn't even kill the program) and (b) crash the Java virtual machine (three times, actually). So, trust me, understanding parallel processing can really be important.


"Theoretical" Computer Science


This is what John Cook was after in his informal poll. Here, my answer is that some parts are very useful, others pretty much not useful at all, and maybe some stuff in between.

Data structures


As a graduate student (in math, on the quarter system), I took a three course sequence required for junior year computer science majors. One course concentrated on how to write operating systems. It's perhaps useful to have a basic knowledge of what an operating system does, but I'm pretty sure an OR person can get that from reading computer magazines or something, without taking a course in it. I've totally forgotten what another one of the courses covered, which suggests that maybe it was not crucial to my professional life.

The third course focused on data structures, and while much of what I learned has probably become obsolete (does anyone still use circular buffers?), it's useful both to know the basics and to understand some of the concerns related to different data structures. More than once, while hacking Java code, I've had to give some thought to whether I would be doing a lot of "give me element 7" (which favors array-like structures) versus "give me the smallest element" (favoring tree-like structures) versus "the same thing is going to get added a zillion times, and you only need to keep one copy" (set interfaces, built over who knows what kind of structure).

Complexity


Another computer science topic you need to know is computational complexity, for a variety of reasons. The first is that, at least if you work in optimization, you will be bombarded by statements about how this problem or that is NP-ridiculous, or this algorithm or that is NP-annoying. (The latter is total b.s. -- NP-hardness is a property of the problem, not the algorithm. Nonetheless, I occasionally see statements like that.) It's important to know both what NP-hardness is (a sign that medium to large problem instances might get pretty annoying) and what it is not (a sign of the apocalypse, or an excuse for skipping the attempt to get an exact solution and immediately dragging out your favorite metaheuristic). You should also understand what a proof of NP-hardness entails, which is more than just saying "it's an integer program".

Beyond NP silliness, though, you need to understand what $O(\dots)$ complexity means, and in particular the fact that an $O(n^2)$ algorithm is slower than an $O(n \log(n))$ algorithm for large $n$, but possibly faster for small $n$. This can help in choosing alternative ways to do computational tasks in your own code.

Finite precision


This may turn up in a numerical analysis class in a mathematics program, or in a computer science course, but either way you really need to understand what rounding and truncation error are, how accurate double-precision floating point arithmetic is, etc. First, this will help you avoid embarrassment by asking questions like "Why does the solver say my 0-1 variable is 0.999999999999, which is clearly an error?". Second, it will give you an appreciation for why doing things with "stiff" matrices can create remarkably goofy results from very straightforward looking problems. That, in turn, will help you understand why "big M" methods are both a boon and a curse in optimization.

I may have more to say on this at a later date, but right now my computer is running out of electrons, so I'll quit here.

No comments:

Post a Comment

Due to intermittent spamming, comments are being moderated. If this is your first time commenting on the blog, please read the Ground Rules for Comments. In particular, if you want to ask an operations research-related question not relevant to this post, consider asking it on Operations Research Stack Exchange.