Monday, May 23, 2011

Stepwise Regression in R: Part II

Someone requested a demonstration of how to use the R stepwise function I previously posted, so here goes.  I'll use the "swiss" data set that comes with R (in the datasets package).  There are six variables in the data set; I will use Fertility as the dependent variable (mainly because that is what the example code in the help file for regsubsets does). The code below assumes that my function is saved in a file named stepwise.R, and that the working directory is set to the folder containing that file.

Decisions to add or delete a variable in my function are based on an F-test comparing the model with and without the variable.  In both cases, the null hypothesis is that the variable's true coefficient is zero, so a variable is deleted if we fail to reject the null hypothesis and added if we succeed in rejecting the null hypothesis.  For demonstration purposes, I will use significance levels $\alpha=0.05$ for deletions and $\alpha=0.10$ for additions.

Here is the code:
source('./stepwise.R') # load the stepwise function
data(swiss) # 47 observations of 6 variables, from dataset package
attach(swiss) # so we don't have to prefix every variable
# we will (arbitrarily) use alpha = 0.05 to add a variable and alpha = 0.10 to remove one
# the first run will start with just a constant term
stepwise(Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ 1, 0.05, 0.10)
# the second run starts with the complete model and initially winnows it
stepwise(Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, 0.05, 0.10)
Created by Pretty R at inside-R.org

The output for the first call (which starts with just a constant term, and builds from there) is as follows:

            Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 70.14255   1.822101 38.49542 1.212895e-36

S = 12.491697, R-sq = 0.000000, R-sq(adj) = 0.000000, C-p = 94.805296
=====
+++ Adding Education 

              Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 79.6100585  2.1040971 37.835734 9.302464e-36
Education   -0.8623503  0.1448447 -5.953619 3.658617e-07

S = 9.446029, R-sq = 0.440616, R-sq(adj) = 0.428185, C-p = 35.204895
=====
+++ Adding Catholic 

              Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 74.2336892 2.35197061 31.562337 7.349828e-32
Education   -0.7883293 0.12929324 -6.097219 2.428340e-07
Catholic     0.1109210 0.02980965  3.720974 5.598332e-04

S = 8.331442, R-sq = 0.574507, R-sq(adj) = 0.555167, C-p = 18.486158
=====
+++ Adding Infant.Mortality 

                    Estimate Std. Error   t value     Pr(>|t|)
(Intercept)      48.67707330 7.91908348  6.146806 2.235983e-07
Education        -0.75924577 0.11679763 -6.500524 6.833658e-08
Catholic          0.09606607 0.02721795  3.529511 1.006201e-03
Infant.Mortality  1.29614813 0.38698777  3.349326 1.693753e-03

S = 7.505417, R-sq = 0.662544, R-sq(adj) = 0.639000, C-p = 8.178162
=====
+++ Adding Agriculture 

                   Estimate Std. Error   t value     Pr(>|t|)
(Intercept)      62.1013116 9.60488611  6.465596 8.491981e-08
Education        -0.9802638 0.14813668 -6.617293 5.139985e-08
Catholic          0.1246664 0.02889350  4.314686 9.503030e-05
Infant.Mortality  1.0784422 0.38186621  2.824136 7.220378e-03
Agriculture      -0.1546175 0.06818992 -2.267454 2.856968e-02

S = 7.168166, R-sq = 0.699348, R-sq(adj) = 0.670714, C-p = 5.032800
=====

So all the predictors other than Examination are added. Now here is the output from the second call:

                   Estimate  Std. Error   t value     Pr(>|t|)
(Intercept)      66.9151817 10.70603759  6.250229 1.906051e-07
Agriculture      -0.1721140  0.07030392 -2.448142 1.872715e-02
Examination      -0.2580082  0.25387820 -1.016268 3.154617e-01
Education        -0.8709401  0.18302860 -4.758492 2.430605e-05
Catholic          0.1041153  0.03525785  2.952969 5.190079e-03
Infant.Mortality  1.0770481  0.38171965  2.821568 7.335715e-03

S = 7.165369, R-sq = 0.706735, R-sq(adj) = 0.670971, C-p = 6.000000
=====
--- Dropping Examination 

                   Estimate Std. Error   t value     Pr(>|t|)
(Intercept)      62.1013116 9.60488611  6.465596 8.491981e-08
Agriculture      -0.1546175 0.06818992 -2.267454 2.856968e-02
Education        -0.9802638 0.14813668 -6.617293 5.139985e-08
Catholic          0.1246664 0.02889350  4.314686 9.503030e-05
Infant.Mortality  1.0784422 0.38186621  2.824136 7.220378e-03

S = 7.168166, R-sq = 0.699348, R-sq(adj) = 0.670714, C-p = 5.032800
===== 
 
We start with the full model, delete Examination, and stop there.

Not content to leave well enough alone, let's see what happens if we start  with the "E" variables (Education and Examination) and go from there.  The code for the function call is

stepwise(Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ Examination + Education, 0.05, 0.10)
Created by Pretty R at inside-R.org

and the output is

              Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 85.2532753  3.0854981 27.630312 1.945244e-29
Examination -0.5572183  0.2319374 -2.402451 2.057160e-02
Education   -0.5394570  0.1924380 -2.803277 7.497224e-03

S = 8.981812, R-sq = 0.505485, R-sq(adj) = 0.483007, C-p = 28.135883
=====
+++ Adding Infant.Mortality 

                   Estimate Std. Error   t value     Pr(>|t|)
(Intercept)      55.2746618  8.8077340  6.275696 1.451652e-07
Examination      -0.5108888  0.2063175 -2.476226 1.729132e-02
Education        -0.5225093  0.1709099 -3.057221 3.832793e-03
Infant.Mortality  1.4556114  0.4064507  3.581274 8.644778e-04

S = 7.973957, R-sq = 0.619096, R-sq(adj) = 0.592521, C-p = 14.252399
=====
+++ Adding Catholic 

                    Estimate Std. Error    t value     Pr(>|t|)
(Intercept)      50.02820666 8.66076269  5.7764204 8.325568e-07
Examination      -0.10580461 0.26036962 -0.4063631 6.865390e-01
Education        -0.70415772 0.17969218 -3.9186887 3.221868e-04
Infant.Mortality  1.30567908 0.39150335  3.3350393 1.790664e-03
Catholic          0.08631125 0.03649293  2.3651501 2.271709e-02

S = 7.579356, R-sq = 0.663865, R-sq(adj) = 0.631853, C-p = 9.993398
=====
--- Dropping Examination 

                    Estimate Std. Error   t value     Pr(>|t|)
(Intercept)      48.67707330 7.91908348  6.146806 2.235983e-07
Education        -0.75924577 0.11679763 -6.500524 6.833658e-08
Infant.Mortality  1.29614813 0.38698777  3.349326 1.693753e-03
Catholic          0.09606607 0.02721795  3.529511 1.006201e-03

S = 7.505417, R-sq = 0.662544, R-sq(adj) = 0.639000, C-p = 8.178162
=====
+++ Adding Agriculture 

                   Estimate Std. Error   t value     Pr(>|t|)
(Intercept)      62.1013116 9.60488611  6.465596 8.491981e-08
Education        -0.9802638 0.14813668 -6.617293 5.139985e-08
Infant.Mortality  1.0784422 0.38186621  2.824136 7.220378e-03
Catholic          0.1246664 0.02889350  4.314686 9.503030e-05
Agriculture      -0.1546175 0.06818992 -2.267454 2.856968e-02

S = 7.168166, R-sq = 0.699348, R-sq(adj) = 0.670714, C-p = 5.032800
===== 
 
All roads seem to lead to the same model. Also note that the constant term was not explicitly stated in the third call (no "1+" in the formulas). As usual with R, a constant term is assumed unless explicitly omitted (using "0 +").

2 comments:

  1. Hi, I used your code with my data and I end up with following error:
    Error in if (pmax > alpha.to.leave) { :
    missing value where TRUE/FALSE needed
    even though I replaced all missing values beforehead with zeros.. Could you suggest what might be the solution as I am not an expert for R.. Thanks!

    ReplyDelete
    Replies
    1. If a model results in a perfect fit (residual variance = 0), I think the drop1 F test generates NAs for all p-values. I'm not sure off the top of my head whether NAs for p-values can occur due to multicollinearity even when the fit is not perfect. You can try changing "pmax > alpha.to.leave" in the if statement to "!is.na(pmax) & pmax > alpha.to.leave", which should screen out the pathological case where pmax = NA (because all p-values are NAs).

      Delete

If this is your first time commenting on the blog, please read the Ground Rules for Comments. In particular, if you want to ask an operations research-related question not relevant to this post, consider asking it on OR-Exchange.