## Sunday, February 27, 2011

### Stepwise Regression in R

Let me start with a disclaimer:  I am not an advocate of stepwise regression.  I teach it in a doctoral seminar (because it's in the book, and because the students may encounter it reading papers), but I try to point out to them some of its limitations.  If you want to read some interesting discussions on the issues with stepwise, search the USENET group sci.stat.math for references to stepwise.  (If you are a proponent of stepwise, I suggest that you don flame retardant underwear first.)

Since I teach stepwise in my seminar, I would like to demonstrate it in R (not to mention some of my students are learning R while doing their homework, which includes a stepwise problem).  The catch is that R seems to lack any library routines to do stepwise as it is normally taught.  There is a function (leaps::regsubsets) that does both best subsets regression and a form of stepwise regression, but it uses AIC or BIC to select models.  That's fine for best subsets, but stepwise (at least as I've seen it in every book or paper where I've encountered it) uses nested model F tests to make decisions.  Again, if you search around, you can find some (fairly old) posts on help forums by people searching (fruitlessly, it appears) for a stepwise implementation in R.

So I finally forced myself to write a stepwise function for R.  My knowledge of R coding is very, very limited, so I made no attempt to tack on very many bells and whistles, let alone make it a library package.  Unlike most R routines, it does not create an object; it just merrily writes to the standard output stream.  There are a number of limitations (expressed in the comments), and I've only tested it on a few data sets.  All that said, I'm going to post it below, in case someone else is desperate to do conventional stepwise regression in R.

=== code follows ===
    #
# This is an R function to perform stepwise regression based on a "nested model" F test for inclusion/exclusion
# of a predictor.  To keep it simple, I made no provision for forcing certain variables to be included in
# all models, did not allow for specification of a data frame, and skipped some consistency checks (such as whether
# the initial model is a subset of the full model).
#
# One other note: since the code uses R's drop1 and add1 functions, it respects hierarchy in models. That is,
# regardless of p values, it will not attempt to drop a term while retaining a higher order interaction
# involving that term, nor will it add an interaction term if the lower order components are not all present.
# (You can of course defeat this by putting interactions into new variables and feeding it what looks like
# a first-order model.)
#
# Consider this to be "beta" code (and feel free to improve it).  I've done very limited testing on it.
#
# Author: Paul A. Rubin (rubin@msu.edu)
#
stepwise <- function(full.model, initial.model, alpha.to.enter, alpha.to.leave) {
# full.model is the model containing all possible terms
# initial.model is the first model to consider
# alpha.to.enter is the significance level above which a variable may enter the model
# alpha.to.leave is the significance level below which a variable may be deleted from the model
# (Useful things for someone to add: specification of a data frame; a list of variables that must be included)
full <- lm(full.model);  # fit the full model
msef <- (summary(full)$sigma)^2; # MSE of full model n <- length(full$residuals);  # sample size
allvars <- attr(full$terms, "predvars"); # this gets a list of all predictor variables current <- lm(initial.model); # this is the current model while (TRUE) { # process each model until we break out of the loop temp <- summary(current); # summary output for the current model rnames <- rownames(temp$coefficients);  # list of terms in the current model
print(temp$coefficients); # write the model description p <- dim(temp$coefficients)[1];  # current model's size
mse <- (temp$sigma)^2; # MSE for current model cp <- (n-p)*mse/msef - (n-2*p); # Mallow's cp fit <- sprintf("\nS = %f, R-sq = %f, R-sq(adj) = %f, C-p = %f", temp$sigma, temp$r.squared, temp$adj.r.squared, cp);
write(fit, file="");  # show the fit
write("=====", file="");  # print a separator
if (p > 1) {  # don't try to drop a term if only one is left
d <- drop1(current, test="F");  # looks for significance of terms based on F tests
pmax <- max(d[-1,6]);  # maximum p-value of any term (have to skip the intercept to avoid an NA value)
if (pmax > alpha.to.leave) {
# we have a candidate for deletion
var <- rownames(d)[d[,6] == pmax];  # name of variable to delete
if (length(var) > 1) {
# if an intercept is present, it will be the first name in the list
# there also could be ties for worst p-value
# taking the second entry if there is more than one is a safe solution to both issues
var <- var[2];
}
write(paste("--- Dropping", var, "\n"), file="");  # print out the variable to be dropped
f <- formula(current);  # current formula
f <- as.formula(paste(f[2], "~", paste(f[3], var, sep=" - ")));  # modify the formula to drop the chosen variable (by subtracting it)
current <- lm(f);  # fit the modified model
}
}
# if we get here, we failed to drop a term; try adding one
# note: add1 throws an error if nothing can be added (current == full), which we trap with tryCatch
a <- tryCatch(add1(current, scope=full, test="F"), error=function(e) NULL);  # looks for significance of possible additions based on F tests
if (is.null(a)) {
break;  # there are no unused variables (or something went splat), so we bail out
}
pmin <- min(a[-1,6]);  # minimum p-value of any term (skipping the intercept again)
if (pmin < alpha.to.enter) {
# we have a candidate for addition to the model
var <- rownames(a)[a[,6] == pmin];  # name of variable to add
if (length(var) > 1) {
# same issue with ties, intercept as above
var <- var[2];
}
f <- formula(current);  # current formula
f <- as.formula(paste(f[2], "~", paste(f[3], var, sep=" + ")));  # modify the formula to add the chosen variable
current <- lm(f);  # fit the modified model
}
# if we get here, we failed to make any changes to the model; time to punt
break;
}
}

Created by Pretty R at inside-R.org

1. Paul,
Love the implementation. I've been using the leaps package to do backwards selection but like you I'm not a huge fan of BS or stepwise. To me its basically a good start but that's all it is.

One method I did was to do a "bootstrap" and do several replications of the selection process over random samples of the data. Then I just created a list of the most frequently selected variables from selection. Again not perfect but may provide a smoother analysis.

I'll give you're routine a try next time I'm creating a model.

2. @Larry: Thanks for the comment, and I appreciate having a second set of eyes on it.

The bulk of the opposition to stepwise I've seen on the Net also applies to best subsets, and (if I dare paraphrase some lengthy discussions) seems to center on letting the data dictate the model. I share those concerns, and caution students accordingly, but in exploratory mode you sometimes need help paring down a universe of models into a set small enough that you can actually study them.

In those situations, my first choice is best subsets, and the only time I'd ever advocate stepwise would be as a pre-filter for best subsets (when you have too many variables for best subsets to handle). In those situations, I run (bi-directional) stepwise multiple times, first starting from an empty model, then a full model, then an initial model comprised of all the variables excluded from the previous models. Any variable that can't find traction in any of those runs gets the heave-ho, until the set or remaining variables is small enough to run through best subsets. Your bootstrapping idea is different, but in a similar vein.

I'm curious why you like backward selection. I've steadily abjured both backward and forward selection. The few papers where I've seen either used seemed to be trying to create an ordering of variables from strongest to weakest predictors.

3. I don't really have reason why I like backwards selection better in that its quicker than stepwise. I've read that Forward Selection can have a weaker selection criteria because it doesn't look at all the data while Backward Selection works from all the data and excludes from there. That's a poor man's description but its all i remember.

I forgot to mention. You might want to put your code on the R tag at Github.
https://github.com/languages/R

4. Paul, just a side-remark, you might want to add a syntax highlighter for beautifying code, so it's easier to read. It does not need to be complicated, you can use online services such as http://alexgorbatchev.com/SyntaxHighlighter/ , though R is not supported here others might. Just a suggestion :-)

5. @Larry: I took a look at Github, but I think it might be overkill for one little function, particularly since I have no ambitions to maintain it. (Not to mention I have a SourceForge account, and account proliferation already has my eyes crossing at random intervals.) If someone smacks me with a two-by-four and I suddenly develop an ambition to blow up that function into an R package, I'll revisit Github. Thanks for the suggestion, though.

6. @Bo: Hadn't thought of syntax highlighting, but you're right, it makes the code more readable. After a little research, I settled on Pretty-R, which did a nice job -- other than introducing the dreaded horizontal scroll.

7. Sorry for making you work on a Sunday :-) Looks much better, nice feature with links into R-docs on function names.

8. Nice implementation. In Oracle Crystal Ball, we have implemented both forward and iterative stepwise regression, and in my experience, when there are too many variables to select from, iterative stepwise regression fares better than the forward variant. Although, as you have mentioned, the result from any of these methods should be a starting point for more in-depth analysis, rather than the finish line.

9. "Too many variables" is the key for me. If I'm dredging for a model (not working from a theoretical basis), I'd rather use best subsets than stepwise. When the pool of variables is huge (which has happened to me with time series when I have indicators for month interacting with everything that can't outrun them), then I may use stepwise to winnow the pool for best subsets.

10. Thanks for this, that is exactly what i need. Can you post an example of input model and how you input alpha.

11. @Ofer: Done (http://orinanobworld.blogspot.com/2011/05/stepwise-regression-in-r-part-ii.html).

12. Thank you for the wonderful job.

13. Hi Paul:
This is a really nice code and I like it.
I think we can even use this code for the Forward selection and Backward elimination as well. For example, to fit a forward selection with alpha to enter as 0.15, we can write stepwise(full.model, initial.model,0.15,1). And to fit a backward elimination with alpha to drop as 0.1, we write stepwise(initial.model,full.model,1,0.1). Here I interchanged the order of initial and full model in the arguments. Is that right? If this is correct, then you may want to add this capability in the description. And by the way, I tested your code over all three methods, and it is working like a charm.

Thank you.
Sina

1. Sina,

Whether your method for backward elimination works will depend on what you specify for the initial model. If the initial model contains an unfortunate subset of the variables, you can put the stepwise function into a loop, adding and deleting things. The safe way to do backward elimination is to set the third parameter (alpha.to.enter) to 0 rather than 1. The first argument can be anything, as long as it is a valid model. The second argument should be the full model, as you had it.

If this is your first time commenting on the blog, please read the Ground Rules for Comments. In particular, if you want to ask an operations research-related question not relevant to this post, consider asking it on OR-Exchange.