Decisions to add or delete a variable in my function are based on an F-test comparing the model with and without the variable. In both cases, the null hypothesis is that the variable's true coefficient is zero, so a variable is deleted if we fail to reject the null hypothesis and added if we succeed in rejecting the null hypothesis. For demonstration purposes, I will use significance levels $\alpha=0.05$ for deletions and $\alpha=0.10$ for additions.

Here is the code:

source('./stepwise.R') # load the stepwise function data(swiss) # 47 observations of 6 variables, from dataset package attach(swiss) # so we don't have to prefix every variable # we will (arbitrarily) use alpha = 0.05 to add a variable and alpha = 0.10 to remove one # the first run will start with just a constant term stepwise(Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ 1, 0.05, 0.10) # the second run starts with the complete model and initially winnows it stepwise(Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, 0.05, 0.10)

The output for the first call (which starts with just a constant term, and builds from there) is as follows:

Estimate Std. Error t value Pr(>|t|) (Intercept) 70.14255 1.822101 38.49542 1.212895e-36 S = 12.491697, R-sq = 0.000000, R-sq(adj) = 0.000000, C-p = 94.805296 ===== +++ Adding Education Estimate Std. Error t value Pr(>|t|) (Intercept) 79.6100585 2.1040971 37.835734 9.302464e-36 Education -0.8623503 0.1448447 -5.953619 3.658617e-07 S = 9.446029, R-sq = 0.440616, R-sq(adj) = 0.428185, C-p = 35.204895 ===== +++ Adding Catholic Estimate Std. Error t value Pr(>|t|) (Intercept) 74.2336892 2.35197061 31.562337 7.349828e-32 Education -0.7883293 0.12929324 -6.097219 2.428340e-07 Catholic 0.1109210 0.02980965 3.720974 5.598332e-04 S = 8.331442, R-sq = 0.574507, R-sq(adj) = 0.555167, C-p = 18.486158 ===== +++ Adding Infant.Mortality Estimate Std. Error t value Pr(>|t|) (Intercept) 48.67707330 7.91908348 6.146806 2.235983e-07 Education -0.75924577 0.11679763 -6.500524 6.833658e-08 Catholic 0.09606607 0.02721795 3.529511 1.006201e-03 Infant.Mortality 1.29614813 0.38698777 3.349326 1.693753e-03 S = 7.505417, R-sq = 0.662544, R-sq(adj) = 0.639000, C-p = 8.178162 ===== +++ Adding Agriculture Estimate Std. Error t value Pr(>|t|) (Intercept) 62.1013116 9.60488611 6.465596 8.491981e-08 Education -0.9802638 0.14813668 -6.617293 5.139985e-08 Catholic 0.1246664 0.02889350 4.314686 9.503030e-05 Infant.Mortality 1.0784422 0.38186621 2.824136 7.220378e-03 Agriculture -0.1546175 0.06818992 -2.267454 2.856968e-02 S = 7.168166, R-sq = 0.699348, R-sq(adj) = 0.670714, C-p = 5.032800 =====

So all the predictors other than Examination are added. Now here is the output from the second call:

Estimate Std. Error t value Pr(>|t|) (Intercept) 66.9151817 10.70603759 6.250229 1.906051e-07 Agriculture -0.1721140 0.07030392 -2.448142 1.872715e-02 Examination -0.2580082 0.25387820 -1.016268 3.154617e-01 Education -0.8709401 0.18302860 -4.758492 2.430605e-05 Catholic 0.1041153 0.03525785 2.952969 5.190079e-03 Infant.Mortality 1.0770481 0.38171965 2.821568 7.335715e-03 S = 7.165369, R-sq = 0.706735, R-sq(adj) = 0.670971, C-p = 6.000000 ===== --- Dropping Examination Estimate Std. Error t value Pr(>|t|) (Intercept) 62.1013116 9.60488611 6.465596 8.491981e-08 Agriculture -0.1546175 0.06818992 -2.267454 2.856968e-02 Education -0.9802638 0.14813668 -6.617293 5.139985e-08 Catholic 0.1246664 0.02889350 4.314686 9.503030e-05 Infant.Mortality 1.0784422 0.38186621 2.824136 7.220378e-03 S = 7.168166, R-sq = 0.699348, R-sq(adj) = 0.670714, C-p = 5.032800 =====

We start with the full model, delete Examination, and stop there.

Not content to leave well enough alone, let's see what happens if we start with the "E" variables (Education and Examination) and go from there. The code for the function call is

stepwise(Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ Examination + Education, 0.05, 0.10)

and the output is

Estimate Std. Error t value Pr(>|t|) (Intercept) 85.2532753 3.0854981 27.630312 1.945244e-29 Examination -0.5572183 0.2319374 -2.402451 2.057160e-02 Education -0.5394570 0.1924380 -2.803277 7.497224e-03 S = 8.981812, R-sq = 0.505485, R-sq(adj) = 0.483007, C-p = 28.135883 ===== +++ Adding Infant.Mortality Estimate Std. Error t value Pr(>|t|) (Intercept) 55.2746618 8.8077340 6.275696 1.451652e-07 Examination -0.5108888 0.2063175 -2.476226 1.729132e-02 Education -0.5225093 0.1709099 -3.057221 3.832793e-03 Infant.Mortality 1.4556114 0.4064507 3.581274 8.644778e-04 S = 7.973957, R-sq = 0.619096, R-sq(adj) = 0.592521, C-p = 14.252399 ===== +++ Adding Catholic Estimate Std. Error t value Pr(>|t|) (Intercept) 50.02820666 8.66076269 5.7764204 8.325568e-07 Examination -0.10580461 0.26036962 -0.4063631 6.865390e-01 Education -0.70415772 0.17969218 -3.9186887 3.221868e-04 Infant.Mortality 1.30567908 0.39150335 3.3350393 1.790664e-03 Catholic 0.08631125 0.03649293 2.3651501 2.271709e-02 S = 7.579356, R-sq = 0.663865, R-sq(adj) = 0.631853, C-p = 9.993398 ===== --- Dropping Examination Estimate Std. Error t value Pr(>|t|) (Intercept) 48.67707330 7.91908348 6.146806 2.235983e-07 Education -0.75924577 0.11679763 -6.500524 6.833658e-08 Infant.Mortality 1.29614813 0.38698777 3.349326 1.693753e-03 Catholic 0.09606607 0.02721795 3.529511 1.006201e-03 S = 7.505417, R-sq = 0.662544, R-sq(adj) = 0.639000, C-p = 8.178162 ===== +++ Adding Agriculture Estimate Std. Error t value Pr(>|t|) (Intercept) 62.1013116 9.60488611 6.465596 8.491981e-08 Education -0.9802638 0.14813668 -6.617293 5.139985e-08 Infant.Mortality 1.0784422 0.38186621 2.824136 7.220378e-03 Catholic 0.1246664 0.02889350 4.314686 9.503030e-05 Agriculture -0.1546175 0.06818992 -2.267454 2.856968e-02 S = 7.168166, R-sq = 0.699348, R-sq(adj) = 0.670714, C-p = 5.032800 =====

All roads seem to lead to the same model. Also note that the constant term was not explicitly stated in the third call (no "1+" in the formulas). As usual with R, a constant term is assumed unless explicitly omitted (using "0 +").

**Update (08/21/17)**: I've posted a new version of the function, together with a new version of this demo, in a single R notebook.

Hi, I used your code with my data and I end up with following error:

ReplyDeleteError in if (pmax > alpha.to.leave) { :

missing value where TRUE/FALSE needed

even though I replaced all missing values beforehead with zeros.. Could you suggest what might be the solution as I am not an expert for R.. Thanks!

If a model results in a perfect fit (residual variance = 0), I think the drop1 F test generates NAs for all p-values. I'm not sure off the top of my head whether NAs for p-values can occur due to multicollinearity even when the fit is not perfect. You can try changing "pmax > alpha.to.leave" in the if statement to "!is.na(pmax) & pmax > alpha.to.leave", which should screen out the pathological case where pmax = NA (because all p-values are NAs).

Delete