Someone asked an interesting question on a support forum recently. The gist was: "How do I confirm that my model is correct?"
On the occasions that I taught simulation modeling, this was a standard topic. Looking back, I don't recall spending nearly as much time on it when teaching optimization, which was a mistake on my part. In those days, operations research/management science topics tended to be taught in OR/MS courses, at least at my institution. Since then, OR/MS topics have to some extent shifted into courses in application areas (such as supply chain management and its siblings), where they necessarily receive less coverage, since they share the course with application content. I have a suspicion that model correctness has slipped even further through the cracks as a result.
There may also be a bit of instructor bias involved. If you are an OR person teaching, say, optimization, you are probably more enthused with the mathematics (and perhaps the computational aspects) than with the quotidian details of applying it. I certainly was. If you are a supply chain instructor, you may want to spend as little time as possible on optimization (including implementation details) because you are anxious to get to the application of the results. Regardless of how it happens, when we don't teach model correctness, I suspect it is employers who pay the price.
Borrowing a bit from an excellent (albeit long in the tooth) simulation book by Law and Kelton[1], I thought I'd review what I know about building credible models. (Since I'm an academic, you may choose to take all this with a large grain of salt.)
From problem to solution
You (or someone) starts with a problem. You turn that into a conceptual model, which could be a mathematical program, simulation, queuing model, forecasting model or whatever – I'm an equal-opportunity offender. The conceptual model exists on paper (or whatever 21st century replacement for paper you use). You translate the conceptual model into computer code (or delegate it to some minion – one reason doctoral students were invented). This could mean writing fairly high-level code in a modeling language (examples: OPL, SIMSCRIPT), a general purpose mathematical/statistical language (examples: MATLAB, R), or a general purpose coding language, likely linked to some libraries (Java, anyone?). With probability approaching 1.0, the conceptual model will involve parameters, which you will need either to obtain from the end user or estimate from data. Once the model is coded and parameterized, you inflict it on some unwary computer and, after the usual endless debugging, hopefully obtain some results. With that, you're done, right? Not quite. There's still the matter of whether the results are (a) correct and (b) useful.
Validation and Verification
There are two key steps in ascertaining whether the results actually are meaningful in the context of the problem. Neither is necessarily easy to do. Verification refers to confirming that the code accurately represents the conceptual model. It occurs once in the flow diagram (at the marked location, where the conceptual model is turned into code.
Validation refers to confirming that the conceptual model conforms to the user's reality. Validation occurs at multiple locations in our flow diagram.
- Linking problem to model: Does your model actually address the user's original problem? This is not frivolous, especially where academics are concerned. I'm pretty sure I've seen models that mysteriously migrated from the relatively simple thing the user needed to a more intriguing (i.e., publishable) problem bearing at most a tangential relation to the user's original issue.
- Linking assumptions to model: Are your assumptions appropriate?
- Sometimes the modeler is seduced into an alternate universe by the quest for computational tractability. I once saw a book on modeling (whose title I've sadly forgotten) that had a chapter about designing an automated chicken plucker. The chapter title was “Assume a Spherical Chicken”. The search for that led me to the Wikipedia page about "spherical cows".
- Sometimes the modeler simply knows no better. In an ideal world, an OR analyst spends time observing (or, better still, participating in) operations before attempting to model and improve them. You help load the trucks or work on the assembly line to get a sense of what actually goes on. Often, though, that's a luxury; as the analyst, you either lack the time or the necessary access. In those cases, you need to be extra diligent in checking your assumptions.
- Linking data to code: Is your data relevant and correctly analyzed? Getting accurate parameter estimates should be considered a part of model validation. A nontrivial part of this is “cleaning” the data. As I once learned the hard way, having operational data in a corporate database and having correct operational data in a corporate database are two entirely distinct things.
So, in a nutshell, pretty much everything other than the one step I marked in the diagram as subject to verification should be considered subject to validation.
How to verify
There are several common techniques for verification. Here are the ones I think are most important. Law and Kelton list a few others as well.
- If the model's output is a deterministic function of its inputs, and if you can find (or construct) test cases with independently known results, you can compare the computational results on those cases with the expected outcomes.
- You can show the model to one or more competent coders (with the background to understand the model statement) and ask them to review the code.
- You can create test cases with “extreme” inputs, run them, and verify that the code produces plausible outputs. For instance, in a queuing model or simulation, you can set the arrival rate equal to or greater than the service rate and verify that the queue explodes. In an optimization model, you can tweak parameters to force a particular constraint to be binding or to have slack, or a particular decision option to be too good to pass up or too expensive to consider, and see if the output matches your tweaking. As a concrete example, I have some code that assigns students to project teams according to certain criteria (mainly, that teams should be as similar to each other as possible). I can easily create test data that would allow for perfectly identical teams to be formed, and I can easily create test data that would prohibit certain mandatory requirements from being met. My code should produce perfect teams in the first case and spit up a useful error message in the second case.
How to validate
Law and Kelton define a model with face validity as “… a model that, on the surface, seems reasonable to people who are knowledgeable about the system under study.” So a good starting point is to describe, in nonmathematical terms, what your model says, and see if the users agree that it sounds appropriate. Note that I wrote “users” (plural). Even if only one person will be responsible for running the code or implementing the solution, it pays to get input from a variety of people familiar with different aspects of the problem. It does little good to cook up a production scheduling model based on input from the person doing the scheduling, only to be told after the fact that the logistics folks either cannot warehouse that much product or cannot move that much product to market in a timely manner.
What I think of as historical validity (I'm not sure that's an official term) is worth checking if historical data for the system is available. Run the model with the historical inputs (parameter settings) and compare the output to the historical results. In a simulation model, you would like the model output to match the historical results (within the confines of what one can expect from an inherently stochastic model). In an optimization model, you would like the model's results to do as well as, if not better than, the historical results. It would also be instructive to check whether the historical solution is feasible in your model. If not, either your constraints are suspect or the users have some way of finagling violations … which should perhaps be baked into the model.
Another thing to try (which might again fit under the umbrella of face validity) is to run scripted scenarios (including edge cases), describe the scenarios to users, show them the model results, and ask if the results seem credible given the scenarios (and, if not, why not). A variation of the scripted scenario option is sensitivity analysis. Start with a scenario the users understand and confirm that the output is credible. Slightly modify one parameter, or at most a small number of parameters, rerun, and show the users the changes in the output. Ask them if they would buy those changes as the appropriate reaction to the change in inputs. For instance, if you are simulating a customer service operation, try adding one server and see if the reductions (hopefully) in waiting time or increases (hopefully) in throughput seem plausible to the people with experience in the operation.
Credibility (or, staying off the shelf)
The ultimate tests of model credibility come in two places. The first is whether the model is ever implemented, or whether it languishes “on the shelf”. Even credible models can end up on the shelf; it happened to me once, when the model's champion within the company was reassigned. Lack of credibility, though, is perhaps the number one reason for a model never to be implemented. The second place credibility comes into play is when the model is implemented: did it make things better?
Validation is a big factor in making the model credible, and keeping it off the shelf, for more reasons than just the obvious one (correctness of the model). Involving users in the model design stage (through face validation) helps to get them familiar with the model, and increases their comfort using it. Subtly, it may also give them a sense of ownership. If they have invested time and energy in the model development, and in particular if they feel their voices have been heard, they will have a stake in seeing the model implemented and in seeing it succeed. They may also be more inclined to revisit assumptions and have an analyst (you?) tinker with the model if changes in the environment cause it to stop tracking reality accurately, whereas uninvested users may be quicker to scrap the model and go back to doing what worked before (or what they find comfortable).
This topic must be in the air now, we have jost posted a similar post (with an actual case study) on the SAS/OR blog: http://blogs.sas.com/content/operations/2015/04/22/simulate-validate/
ReplyDelete"Great minds often think alike", so it stands to reason the same might hold for those of us with lesser minds. :-) I hit a 404 error with the link you posted, but I think I found the post you meant at http://blogs.sas.com/content/operations/2015/05/05/simulate-validate/. (Perhaps it was updated and the date changed?) Anyway, thanks for the link.
DeleteThanks for correcting the link Paul. This is a workpress quirk: after publishing the post we changed the date but that is also included in the link. Aarghh.
ReplyDeleteI was wondering, what about semi-automatic ways of comparing models, e.g. by evidence ( averaging the likelihood with a prior weighting distribution ) as in the Bayesian framework?
ReplyDeleteGood question -- and one I'm not really qualified to answer, since I've not done much in the way of Bayesian modeling. That, of course, won't stop me from trying. :-)
DeleteFor statistical models, my inclination is to say that I would take that sort of comparison as input in the model selection process, but I would not treat it as definitive. No automatic technique, to the best of my knowledge, can replicate domain knowledge. Given all the spurious correlations running around out there, I tend to think that it's not too hard for a tight fitting but utterly bogus model to win those sorts of automated comparisons.
For optimization models, I don't see the approach being applicable. For simulation models, that sort of comparison might make sense, at least to the extent of throwing up a red flag if the model does poorly.