Monday, August 21, 2017

Updated Stepwise Regression Function

Back in 2011, when I was still teaching, I cobbled together some R code to demonstrate stepwise regression using F-tests for variable significance. It was a bit unrefined, not intended for production work, and a few recent comments on that post raised some issues with it. So I've worked up a new and (slightly) improved version of it.

The new version is provided in an R notebook that contains both the stepwise function itself and some demonstration code using it. It does not require an R libraries besides the "base" and "stats" packages. There is at least one butt-ugly hack in it that would keep me from being hired in any sort of programming job, but so far it has passed all the tests I've thrown at it. If you run into issues with it, feel free to use the comment section below to let me know. I'm no longer teaching, though, so be warned that maintenance on this is not my highest priority.

The updated function has a few new features:
  • it returns the final model (as an lm object), which I didn't bother to do in the earlier version;
  • you can specify the initial and full models as either formulas (y~x+z) or strings ("y~x+z"), i.e., quotes are strictly optional; and
  • as with the lm function, it has an optional data = ... argument that allows you to specify a data frame.
There are also a few bug fixes:
  • if you set the alpha-to-enter greater than the alpha-to-leave, which could throw the function into an indefinite loop, the function will now crab at you and return NA
  • if you try to fit a model with more parameters than you have observations, the function will now crab at you and return NA; and
  • the function no longer gets confused (I think) if you happen to pick variable/column names that happen to clash with variable names used inside the function.
As always, the code is provided with a Creative Commons license, as-is, no warranty express or implied, your mileage may vary.


  1. Thanks for the updated function. It helped me in my project.

  2. I am trouble integrating my own data frame. How do I modify the function arguments with a data from generated within the script?

    Thank you!

    1. You just use the optional "data" argument to specify the data frame. So if your data frame is in the variable "whatever", you just call stepwise(..., data = whatever).

    2. Do you need to modify the function at all? Do you remove the data = NULL?

    3. No. The "data = NULL" in the existing code declares a data argument and gives it the default value NULL. If you invoke the function without specifying the data argument, it expects the model variables to be globally defined. If you pass in a data frame in the data argument, it looks for the model variables in that data frame. This is exactly how the lm() function works.


Due to recent spamming, comments are being moderated. If this is your first time commenting on the blog, please read the Ground Rules for Comments. In particular, if you want to ask an operations research-related question not relevant to this post, consider asking it on Mathematics Stack Exchange.