Monday, August 21, 2017

Updated Stepwise Regression Function

Back in 2011, when I was still teaching, I cobbled together some R code to demonstrate stepwise regression using F-tests for variable significance. It was a bit unrefined, not intended for production work, and a few recent comments on that post raised some issues with it. So I've worked up a new and (slightly) improved version of it.

The new version is provided in an R notebook that contains both the stepwise function itself and some demonstration code using it. It does not require an R libraries besides the "base" and "stats" packages. There is at least one butt-ugly hack in it that would keep me from being hired in any sort of programming job, but so far it has passed all the tests I've thrown at it. If you run into issues with it, feel free to use the comment section below to let me know. I'm no longer teaching, though, so be warned that maintenance on this is not my highest priority.

The updated function has a few new features:
  • it returns the final model (as an lm object), which I didn't bother to do in the earlier version;
  • you can specify the initial and full models as either formulas (y~x+z) or strings ("y~x+z"), i.e., quotes are strictly optional; and
  • as with the lm function, it has an optional data = ... argument that allows you to specify a data frame.
There are also a few bug fixes:
  • if you set the alpha-to-enter greater than the alpha-to-leave, which could throw the function into an indefinite loop, the function will now crab at you and return NA
  • if you try to fit a model with more parameters than you have observations, the function will now crab at you and return NA; and
  • the function no longer gets confused (I think) if you happen to pick variable/column names that happen to clash with variable names used inside the function.
As always, the code is provided with a Creative Commons license, as-is, no warranty express or implied, your mileage may vary.

Update (11/09/18): I've tweaked the code to add a few features. See here for a post about the updates.

Monday, August 7, 2017

Rolling Horizons

I keep seeing questions posted by people looking for help as they struggle to optimize linear programs (or, worse, integer linear programs) with tens of millions of variables. In my conscious mind, I know that commercial optimizers such as CPLEX allow models that large (at least if you have enough memory) and can often solve them (at least if you have enough computing resources to throw at the problems). My lizard brain, however, was conditioned by the state of the art in the late '70s to send me running (typically while screaming) at the sight of a model with more than, oh, 50 or so variables. Wrapping my head around tens of millions of variables, let alone thinking of a strategy for getting an optimal solution "quickly", is quite a struggle.

A former acquaintance from my student days once articulated his strategy for essay exams to me: if you don't know the answer, argue the premise of the question. In that vein, I'm inclined to ask why so many variables are necessary. One possibility is that the model captures decisions for a sequence of time periods. If decisions in one time period had no impact on subsequent periods, the problem would decompose naturally into a bunch of smaller problems; so if we are talking about a multiperiod model, it's safe to assume that the periods are connected.

That brings me to a strategy I learned back in primitive times, the "rolling horizon" approach. Let me stipulate up front that this is a heuristic method. It does not provide a provably optimal solution for the entire time horizon. Still, given the up-tick in humongous models, I'm starting to wonder if rolling horizons are no longer taught (or are looked down on).

The basic concept is simple. The devil is in the details. Let's say we want to solve a planning model over a horizon of $H$ time periods, and that one omnibus model for the entire horizon is proving intractable. The rolling horizon approach is as follows.
  1. Pick a shorter horizon $K < H$ that is tractable, and a number of periods $F \le K$ to freeze.
  2. Set "boundary conditions" (more on this below).
  3. Solve a model for periods $1,\dots,K$, incorporating the boundary conditions.
  4. Freeze the decisions for periods $1,\dots,F$.
  5. Solve a model for periods $F+1,\dots, \min(F+K, H)$.
  6. Repeat ad nauseam.
Note that if $F<K$, some decisions made in each solution but not frozen will be subject to revision in the next model.

The initial conditions (starting inventory, locations of vehicles, pending orders, ...) for each model are dictated by the state of the system after the last frozen period. The boundary conditions are limits on how much of a mess you can leave at the end of the reduced planning horizon (period $K$ in the first model, $K+F$ in the second model, etc.). More precisely, they limit the terminal state of things that will be initial conditions in the next model.

As a concrete example, consider a manufacturing scheduling model. You start with inventories of components and finished products, available capacity of various kinds, and unfilled orders, and you end with the same kinds of things. Without boundary conditions, your solution for the first model might end period $K$ with no finished goods inventory. Why make stuff if it doesn't count toward your demand within the time frame considered by the model? That might make the problem for periods $K+1$ onward infeasible, though: you have orders that must be filled, they exceed your immediate production capacity, and the cupboard was left bare by the previous solution.

So you want to add constraints of the form "don't leave me with less than this much inventory on hand" or "don't leave me with more than this many unfilled orders". Picking values for those limits is a bit of an art form. Make the limits too draconian and you will get a very suboptimal solution (say, piling up way more inventory than you'll really need) or possibly even make the early subproblems infeasible. Make the limits too laissez faire, and you may force thoroughly suboptimal solutions to later subproblems (piling on lots of expensive overtime, blowing off soon-to-be-former customers) or even make the later problems infeasible. Still, it's usually possible to come up with pretty reasonable boundary conditions, perhaps with some trial and error, and it's a way to get a solution that, if not globally optimal, is at least good (and preferable to staring bleary-eyed at your screen in the 30th hour of a run, wondering if the beast will ever converge).

The term "rolling horizon" was coined in reference to problems that are decomposed based on time, but the same concept may extent to problems that can be decomposed based on position. I used it not long ago to get a heuristic for a problem placing nodes in a physical network. In essence, we took the original geographic region and chopped it into pieces, picked a piece to start from, and then worked our way outward from that piece until all the pieces had been solved, each based on the (frozen) solution to the more inward pieces.