I've come to like
R quite a bit for statistical computing, even though as a language it can be rather quirky. (Case in point: The
anova() function compares two or more models using analysis of variance; if you want to fit an
ANOVA model, you need to use the
aov() function.) I don't use it that often, though, which is a mixed blessing. The bad new is that infrequent use makes it hard for me to remember everything I've learned about the language. The good news is that infrequent use means I'm not having to do statistical analysis very often.
I don't think I'm alone in believing that consistent coding patterns (paradigms, idioms, whatever you want to call them) are very helpful when using a language infrequently. That motivates today's post, on testing significance of a regression model. By model significance, I mean (in somewhat loose terms) testing
H0: the null model (no predictors other than a constant term) fits the data at least as well as our model
versus
H1: our model fits the data better than the null model.
When performing a standard linear regression, the usual test of model significance is an
F-test. As with most (all?) statistics packages, R helpfully prints out the
p-value for this test in the summary output of the regression, so you can see whether your model is (literally) better than nothing without any extra work. To test whether a second model (call it
model2) improves on model 1 significantly, you use the anova() command:
anova(model1, model2)
which is easy enough to remember.
When performing a
generalized linear regression, however, R does not automatically give you a model significance test. I'll focus here on a
binary logit model (dependent variable binary), but I'm pretty sure the various approaches apply to other uses of the GLM, perhaps with some tweaks.
Let's say that
model1 is a binary logistic regression model I've fitted in R. The most common test for significance of a binary logistic model is a
chi-square test, based on the change in
deviance when you add your predictors to the null model. R will automatically calculate the deviance for both your model and the null model when you run the
glm() command to fit the model. The approach to testing significance that I've seen on a number of web pages, including
this one, involves calculating the p-value manually, using some variation of the following syntax:
with(model1, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
That's fine (nice and compact), but prospects of my remembering it are slender at best. Fortunately, we can use the aforementioned
anova() command by manually fitting the null model. First rerun the logistic regression using just a constant term. Call the resulting fit
null.model. Now compare
null.model to
model1 using the
anova() command, adding an argument to tell R that you want a chi-square test rather than the default F test:
anova(null.model, model1, test = "Chisquare")
You can also use the same syntax to compare two fitted logistic models for the same data, say where
model2 adds some predictors to
model1. For me, that's a lot easier to remember than the manual approach.
Here's some heavily annotated code (or you can
download it), if you want to see an example: