## Monday, May 23, 2011

### Stepwise Regression in R: Part II

Someone requested a demonstration of how to use the R stepwise function I previously posted, so here goes.  I'll use the "swiss" data set that comes with R (in the datasets package).  There are six variables in the data set; I will use Fertility as the dependent variable (mainly because that is what the example code in the help file for regsubsets does). The code below assumes that my function is saved in a file named stepwise.R, and that the working directory is set to the folder containing that file.

Decisions to add or delete a variable in my function are based on an F-test comparing the model with and without the variable.  In both cases, the null hypothesis is that the variable's true coefficient is zero, so a variable is deleted if we fail to reject the null hypothesis and added if we succeed in rejecting the null hypothesis.  For demonstration purposes, I will use significance levels $\alpha=0.05$ for deletions and $\alpha=0.10$ for additions.

Here is the code:
source('./stepwise.R') # load the stepwise function
data(swiss) # 47 observations of 6 variables, from dataset package
attach(swiss) # so we don't have to prefix every variable
# we will (arbitrarily) use alpha = 0.05 to add a variable and alpha = 0.10 to remove one
stepwise(Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ 1, 0.05, 0.10)
# the second run starts with the complete model and initially winnows it
stepwise(Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, 0.05, 0.10)
Created by Pretty R at inside-R.org

The output for the first call (which starts with just a constant term, and builds from there) is as follows:

Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 70.14255   1.822101 38.49542 1.212895e-36

S = 12.491697, R-sq = 0.000000, R-sq(adj) = 0.000000, C-p = 94.805296
=====

Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 79.6100585  2.1040971 37.835734 9.302464e-36
Education   -0.8623503  0.1448447 -5.953619 3.658617e-07

S = 9.446029, R-sq = 0.440616, R-sq(adj) = 0.428185, C-p = 35.204895
=====

Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 74.2336892 2.35197061 31.562337 7.349828e-32
Education   -0.7883293 0.12929324 -6.097219 2.428340e-07
Catholic     0.1109210 0.02980965  3.720974 5.598332e-04

S = 8.331442, R-sq = 0.574507, R-sq(adj) = 0.555167, C-p = 18.486158
=====

Estimate Std. Error   t value     Pr(>|t|)
(Intercept)      48.67707330 7.91908348  6.146806 2.235983e-07
Education        -0.75924577 0.11679763 -6.500524 6.833658e-08
Catholic          0.09606607 0.02721795  3.529511 1.006201e-03
Infant.Mortality  1.29614813 0.38698777  3.349326 1.693753e-03

S = 7.505417, R-sq = 0.662544, R-sq(adj) = 0.639000, C-p = 8.178162
=====

Estimate Std. Error   t value     Pr(>|t|)
(Intercept)      62.1013116 9.60488611  6.465596 8.491981e-08
Education        -0.9802638 0.14813668 -6.617293 5.139985e-08
Catholic          0.1246664 0.02889350  4.314686 9.503030e-05
Infant.Mortality  1.0784422 0.38186621  2.824136 7.220378e-03
Agriculture      -0.1546175 0.06818992 -2.267454 2.856968e-02

S = 7.168166, R-sq = 0.699348, R-sq(adj) = 0.670714, C-p = 5.032800
=====

So all the predictors other than Examination are added. Now here is the output from the second call:

Estimate  Std. Error   t value     Pr(>|t|)
(Intercept)      66.9151817 10.70603759  6.250229 1.906051e-07
Agriculture      -0.1721140  0.07030392 -2.448142 1.872715e-02
Examination      -0.2580082  0.25387820 -1.016268 3.154617e-01
Education        -0.8709401  0.18302860 -4.758492 2.430605e-05
Catholic          0.1041153  0.03525785  2.952969 5.190079e-03
Infant.Mortality  1.0770481  0.38171965  2.821568 7.335715e-03

S = 7.165369, R-sq = 0.706735, R-sq(adj) = 0.670971, C-p = 6.000000
=====
--- Dropping Examination

Estimate Std. Error   t value     Pr(>|t|)
(Intercept)      62.1013116 9.60488611  6.465596 8.491981e-08
Agriculture      -0.1546175 0.06818992 -2.267454 2.856968e-02
Education        -0.9802638 0.14813668 -6.617293 5.139985e-08
Catholic          0.1246664 0.02889350  4.314686 9.503030e-05
Infant.Mortality  1.0784422 0.38186621  2.824136 7.220378e-03

S = 7.168166, R-sq = 0.699348, R-sq(adj) = 0.670714, C-p = 5.032800
=====

Not content to leave well enough alone, let's see what happens if we start  with the "E" variables (Education and Examination) and go from there.  The code for the function call is

stepwise(Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ Examination + Education, 0.05, 0.10)
Created by Pretty R at inside-R.org

and the output is

Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 85.2532753  3.0854981 27.630312 1.945244e-29
Examination -0.5572183  0.2319374 -2.402451 2.057160e-02
Education   -0.5394570  0.1924380 -2.803277 7.497224e-03

S = 8.981812, R-sq = 0.505485, R-sq(adj) = 0.483007, C-p = 28.135883
=====

Estimate Std. Error   t value     Pr(>|t|)
(Intercept)      55.2746618  8.8077340  6.275696 1.451652e-07
Examination      -0.5108888  0.2063175 -2.476226 1.729132e-02
Education        -0.5225093  0.1709099 -3.057221 3.832793e-03
Infant.Mortality  1.4556114  0.4064507  3.581274 8.644778e-04

S = 7.973957, R-sq = 0.619096, R-sq(adj) = 0.592521, C-p = 14.252399
=====

Estimate Std. Error    t value     Pr(>|t|)
(Intercept)      50.02820666 8.66076269  5.7764204 8.325568e-07
Examination      -0.10580461 0.26036962 -0.4063631 6.865390e-01
Education        -0.70415772 0.17969218 -3.9186887 3.221868e-04
Infant.Mortality  1.30567908 0.39150335  3.3350393 1.790664e-03
Catholic          0.08631125 0.03649293  2.3651501 2.271709e-02

S = 7.579356, R-sq = 0.663865, R-sq(adj) = 0.631853, C-p = 9.993398
=====
--- Dropping Examination

Estimate Std. Error   t value     Pr(>|t|)
(Intercept)      48.67707330 7.91908348  6.146806 2.235983e-07
Education        -0.75924577 0.11679763 -6.500524 6.833658e-08
Infant.Mortality  1.29614813 0.38698777  3.349326 1.693753e-03
Catholic          0.09606607 0.02721795  3.529511 1.006201e-03

S = 7.505417, R-sq = 0.662544, R-sq(adj) = 0.639000, C-p = 8.178162
=====

Estimate Std. Error   t value     Pr(>|t|)
(Intercept)      62.1013116 9.60488611  6.465596 8.491981e-08
Education        -0.9802638 0.14813668 -6.617293 5.139985e-08
Infant.Mortality  1.0784422 0.38186621  2.824136 7.220378e-03
Catholic          0.1246664 0.02889350  4.314686 9.503030e-05
Agriculture      -0.1546175 0.06818992 -2.267454 2.856968e-02

S = 7.168166, R-sq = 0.699348, R-sq(adj) = 0.670714, C-p = 5.032800
=====

All roads seem to lead to the same model. Also note that the constant term was not explicitly stated in the third call (no "1+" in the formulas). As usual with R, a constant term is assumed unless explicitly omitted (using "0 +").