Monday, August 21, 2017

Updated Stepwise Regression Function

Back in 2011, when I was still teaching, I cobbled together some R code to demonstrate stepwise regression using F-tests for variable significance. It was a bit unrefined, not intended for production work, and a few recent comments on that post raised some issues with it. So I've worked up a new and (slightly) improved version of it.

The new version is provided in an R notebook that contains both the stepwise function itself and some demonstration code using it. It does not require an R libraries besides the "base" and "stats" packages. There is at least one butt-ugly hack in it that would keep me from being hired in any sort of programming job, but so far it has passed all the tests I've thrown at it. If you run into issues with it, feel free to use the comment section below to let me know. I'm no longer teaching, though, so be warned that maintenance on this is not my highest priority.

The updated function has a few new features:
  • it returns the final model (as an lm object), which I didn't bother to do in the earlier version;
  • you can specify the initial and full models as either formulas (y~x+z) or strings ("y~x+z"), i.e., quotes are strictly optional; and
  • as with the lm function, it has an optional data = ... argument that allows you to specify a data frame.
There are also a few bug fixes:
  • if you set the alpha-to-enter greater than the alpha-to-leave, which could throw the function into an indefinite loop, the function will now crab at you and return NA
  • if you try to fit a model with more parameters than you have observations, the function will now crab at you and return NA; and
  • the function no longer gets confused (I think) if you happen to pick variable/column names that happen to clash with variable names used inside the function.
As always, the code is provided with a Creative Commons license, as-is, no warranty express or implied, your mileage may vary.

Update (11/09/18): I've tweaked the code to add a few features. See here for a post about the updates.

16 comments:

  1. Thanks for the updated function. It helped me in my project.

    ReplyDelete
  2. I am trouble integrating my own data frame. How do I modify the function arguments with a data from generated within the script?

    Thank you!

    ReplyDelete
    Replies
    1. You just use the optional "data" argument to specify the data frame. So if your data frame is in the variable "whatever", you just call stepwise(..., data = whatever).

      Delete
    2. Do you need to modify the function at all? Do you remove the data = NULL?

      Delete
    3. No. The "data = NULL" in the existing code declares a data argument and gives it the default value NULL. If you invoke the function without specifying the data argument, it expects the model variables to be globally defined. If you pass in a data frame in the data argument, it looks for the model variables in that data frame. This is exactly how the lm() function works.

      Delete
  3. Hello, is it possible to use as criteria to add or drop variables the P-Value instead of the F-test?

    ReplyDelete
    Replies
    1. Does the code currently allow that? No. Could the code be modified to use the p-value? Yes (assuming you are talking about the two-sided p-value, which is what is generated in the summary of lm models in R, and in most other software). Would it make a difference? No. Using two-sided p-values would be equivalent to using alpha-to-enter/alpha-to-leave with a two-sided t-test of each coefficient. If my memory serves me correctly (iffy these days), the two-sided t-test and the F-test are algebraically equivalent (the F statistic is the square of the t statistic), so decisions on which terms were significant (and thus which variables to add or drop) would be identical to what the current code gives.

      Delete
    2. Ok, thank you very much for your reply!

      Delete
    3. I'm not sure I was entirely clear in my previous response. The code currently uses p-values for the F test of each coefficient to pick which variable gets in (or out). Those p-values are identical to what you would get for the two-sided t-tests (what you normally see in regression output from software), so it's basically already doing what you had in mind. (One-sided t-tests are possible, but not commonly used in this context.)

      Delete
  4. Fantastic! My manager is more familiar with SPSS than R and wants to see it done like this. My only issue is that, despite setting alpha to enter less than alpha to exit, my dataset keeps looping with the same variable entering then exiting over and over. I pulled the first variable from the data set that triggered that, and then it started doing that to another variable.

    I'm thinking about adding some code that stores the previously dropped variable to make sure we don't add it right back in, but I wanted to check in here, first, before messing with it too much.

    Have you seen that happen?

    ReplyDelete
    Replies
    1. Are you saying that a variable enters, exits immediately (without any intervening changes to the model), enters again etc.? That should be impossible. If it happened, I would start to wonder whether there was some numerical instability in either the regression fitting or the calculation of the p-values. On the other hand, if it's something like "x enters, y enters, x leaves, z enters, y leaves, x enters, ...", I think that may be possible with annoying correlated predictors ... maybe. I don't recall ever seeing it happen, though. As far as blocking the variable that just left, that won't help in the second case, and might turn the first case into the second case. Hard to say.

      Delete
    2. Yes, that's exactly what is happening. I start with "output" ~ 1, and it adds about five variables, then when it adds the sixth, it drops it, then adds it again, then drops it again, and continues to do so until I interrupt it.

      I looked pretty carefully at the code and I think I know why. When it attempts to add, it looks into the full model results. When it attempts to drop, it looks at the current model results.

      Delete
    3. No, that's not it. The first two arguments to the add1() function are the current model and the full model. The full model is used only as a source of variables to consider. The add1() function takes each variable in full but not in current and evaluates what happens when that variable is added to current. It does not use p-values or t-statistics from the full model.

      Two questions come to mind. First, what are your choices for alpha-to-enter and alpha-to-leave. Second, what does the code say the p-values are for the "yo-yo" variable (both when it is chosen to enter and when it is chosen to leave)? Those are not currently printed, so you'll need to hack the code to print them (pmin and pmax respectively).

      Delete
  5. Hi Paul,

    this post at Cross Validated mentions that it is possible to use your function not only as stepwise, but as a backwards and forwards too. Can I ask you how?

    If I wish to do backward elimination, I guess I should only define the argument full.model and alpha.to.leave? Analogically, for a forward selection, should I start with an initial.model and only enter alpha.to.enter?

    Thank you!

    ReplyDelete
    Replies
    1. You had the right idea, but the previous code might not have worked as expected with omitted alpha values. I've just updated the code again. You can now skip alpha.to.enter for backward regression, or skip alpha.to.leave for forward regression. Keep an eye out for a short post with the latest updates.

      Delete

Due to recent spamming, comments are being moderated. If this is your first time commenting on the blog, please read the Ground Rules for Comments. In particular, if you want to ask an operations research-related question not relevant to this post, consider asking it on Mathematics Stack Exchange.