Friday, June 23, 2017

Premature Obituaries

[T]he report of my death was an exaggeration. (Mark Twain, 1897)
In a recent blog post, "Data Science Is Not Dead", Jean-Francois Puget discussed and dissented with a post by Jeroen ter Heerdt titled "Data Science is dead." Barring the possibility that Schroedinger shoved data science into a box and sealed it, both assertions cannot simultaneously be true. The central thesis of ter Heerdt's post is that data scientists have developed powerful and easy to use tools, now deployed on cloud services, that let lay people do the analyses that previously required said data scientists, in effect putting themselves out of business. Puget responds that "there is a very long tail of use cases where developing a packaged service isn't worth it", and draws parallels to operations research. Common and important problems such as routing, machine scheduling, crew scheduling and so on have led to the development of problem-specific commercial software, but "many companies still have an OR department (under that name, or as part of their data science department) because they have operations problems that cannot be optimized with off the shelf software of services".

I'm siding with Puget on this, having experienced the "demise" of management science (if not all of OR) some decades ago. When I started on the faculty at Michigan State University, management science (whether under that name, "operations research" or something else) was a common and important element of business school programs. We had mandatory core courses in M.S. at both the undergraduate and masters levels, as well as higher level "elective" courses that were de facto requirements for some doctoral concentrations. We also participated in an interdisciplinary OR program at the masters level.

Gradually (but not gradually enough for me), MS evaporated at MSU (a bit ironic given the respective acronyms). Some of the more applied subject matter was moved into "functional area" courses (production planning, marketing, finance); most of the more conceptual subject matter just went away. As previously noted, canned software began to be available to solve many of the problems. The perceived need shifted from someone who understood algorithms to someone who could model the problem well enough to generate the inputs for the software.

As Puget notes, there is still demand for OR/MS professionals because there are new problems to be recognized and modeled, and models to be solved that do not fit neatly into the canned software. I believe there is also another reason not to start shoveling dirt on the grave of OR/MS. Users who learned a few basic incantations in a functional area class, without learning how the magic works (or does not work), may misapply techniques or produce incorrect models. Those who learn OR/MS (or analytics) as incantations may also tend to be a bit too literal-minded.

A possible analogy is the difference between a chef and someone like me who equates "cook" with "microwave on high". A chef understands a recipe as a framework, to be adjusted as needed. You couldn't get the cut of meat called for? Switch to this meat, then adjust this spice and cook a bit longer. On the (exceedingly rare) occasions I actually prepare a dish, I follow the instructions slavishly and try not to freelance at all.

Along those lines, I attended a thesis proposal defense (for a student not trained in OR/MS) where the work involved delivery routing and included a variant of a traveling salesman model. Both the candidate and his committee took it as axiomatic that a vehicle could not pass through the same node in the routing graph twice because, well, that's part of the definition of the TSP. So I posed the following simple question. You have a warehouse W and two customers A and B, all on the same street, with W between A and B. A is in a cul de sac, so the network diagram looks like

A -- W -- B -- whatever

with any number of links to B but only the A-W edge incident A (and only A-W and B-W incident on W). Trivial exercise: prove that, under the strict definition of a TSP, A and B cannot be on the same route, no matter how close they are to W (and each other).

My point here was not to bust the student's chops. An OR "chef" building a network model for truck deliveries would (hopefully) recognize that the arcs should represent the best (shortest, fastest, cheapest ...) way to travel between any pair of nodes, and not just physical roads. So, in the above example, there should be arcs between A and B that represent going "by way of" W. It's fine to say that you will stop at any destination exactly once, but I know of only two reasons why one would route a vehicle with a requirement that it pass through any location at most once: it's either laying land mines or dribbling radioactive waste behind it. Hopefully neither applied in the student's case.

The student, on the other hand, could not see past "arc = physical roadway" ... nor was he the only one. After the defense, the student contacted two companies that produce routing software for the trucking industry. According to him (and I'll have to take his word for this), neither of them had given any thought to the possibility that passing through the same node twice might be optimal, or to creating "meta-arcs" that represent best available routes rather than just direct physical links. If true, it serves as a reminder that canned software is not always correct software. Caveat emptor.

The flip side of being too "cookie cutter" in modeling is being too unstructured. As an instructor, I cringed at students (and authors) who thought a linear regression model could be applied anywhere, or that it was appropriate to pick any variable, even a categorical variable, as the dependent variable for a linear regression. To me, at least, neural networks and software services sold as "machine learning" are more opaque than regression models. Having someone with modest if any data science training cramming whatever data is at hand into one end of a statistical black box and accepting whatever comes out the other end does not bode well.

So, in addition to Puget's "tail problems", I think there will always be value in having some people trained at the "chef" level, whether it be in OR/MS or data science, working alongside the "line cooks".

Update: I just came across an article published by CIO, "The hidden risk of blind trust in AI’s ‘black box’", that I think supports the contention that companies will need data scientists despite the availability of machine learning tools. The gist is that not understanding how the AI tool arrives at its conclusions or predictions can creative some hazard, particularly in regulated industries. For instance, if there is some question about whether a company discriminates illegally in providing mortgages or loans, "the machine said" may not be an adequate defense.