Yesterday I had a rather rude reminder (actually, two) of something I've known for a while. I was running a Java program that uses CPLEX to solve an integer programming model. The symptoms were as follows: shortly after the IP solver run started, I ran out of RAM, the operating system started paging memory to the hard drive, and the resulting hard drive thrashing made the system extremely sluggish (to put it charitably). Long after I regained enough control to kill the program, there was a significant amount of disk activity (and concomitant noise) as the system gradually came back to its senses.
How did this happen? My system has four cores, which means CPLEX defaults to running four parallel threads when solving models. What's not always obvious is that each thread gets a separate copy of the model. (I'm not sure if it is the entire model after presolving or just most of it, but it's definitely a large chunk.) In my particular case, the model begins with an unpredictably large number of constraints, and when that number got big, the model got big -- not big enough to be a problem if I used a single thread, but too big to get away with four copies of it ... or, as it turned out, three copies. (The second thrashing event was when I tried to run it with three threads.)
Parallel threading is great, but there are two caveats associated with it. First, performance is a sublinear function of the number of threads, meaning that doubling the number of threads will not cut run time in half, tripling the number of threads will not cut run time to a third the single-thread time, and so on. Second, if you are dealing with a largish model, you might want to try running a little while with a single thread to see how much memory it eats, and then decide how many copies your system can handle comfortably. That's an upper bound on how many threads you should use.
Friday, January 25, 2019
2 comments:
Due to intermittent spamming, comments are being moderated. If this is your first time commenting on the blog, please read the Ground Rules for Comments. In particular, if you want to ask an operations research-related question not relevant to this post, consider asking it on Operations Research Stack Exchange.
Subscribe to:
Post Comments (Atom)
Thanks for sharing this, this is really interesting. And computational tests seem to confirm that with the current software setup there is definitely an upper bound to the number of threads that give a speedup (seems to be 10-20 at the very most).
ReplyDeleteMy question is though whether you think that we could fundamentally change this by changing the architecture of the algorithms? I.e. do all the nodes need to have a local copy of the model? Or is this just the way it is going to be?
I've tended to maintain blissful ignorance of the algorithms used since they got creative and started factoring matrices rather than doing Gaussian pivoting. (This means I pretty much tuned out before you were born.) That said, there's always the option to keep one tree and just use the extra threads for various vector operations, such as refactoring the basis matrix or multiplying the RHS of some node problem by a known dual solution to see what you get. In fact, I don't know whether other solvers make copies of some or all of the problem matrix for each thread.
DeleteApparently the code wonks at IBM decided that this was the route that gave the best "average" performance, and I bow to their superior knowledge about this stuff. I do my work on a single PC, so I have no experience with clusters, but my guess is that when the various "cores" you are using are actually on different machines or blades, with their own memory, the matrix copy thing may be a bit of a non-issue. On a single desktop PC, it clearly can bite you.
Lastly, regarding speed, as you divvy up the work into more and more threads, you encounter more and more effort to coordinate them, to do the division of load (and reintegration of the results), and to handle requisite communication. So there is what I think is an unavoidable law of diminishing returns, and at some point the amount of back and forth exceeds the amount of actual progress. (In other words, you've reinvented Congress.) Futzing with the code base might change the number of threads where gain turns into loss, but there will always be a limit.