Tuesday, March 8, 2011

A Vote for VIFs

Shown below is output from a linear multiple regression model run in Minitab:

Predictor   Coef SE Coef     T     P
Constant  2.4765  0.7055  3.51 0.001
DIAMETER -0.7880  0.1826 -4.31 0.000
WEIGHT   0.44314 0.03633 12.20 0.000 

Other statistics programs I've used, or whose output I've seen, may format things differently, but they generally stick to the same fundamental ingredients: the estimated coefficient value; the standard error of that coefficient; the T-ratio; and the p-value of that ratio.  Note that there is a certain redundancy here.  If I know any two of the first three items, I can figure out the third rather easily.  If I know the T-ratio (and the degrees of freedom), I can get the p-value.  (Going the other direction is subject to accuracy problems, since small p-values tend to be truncated).

Redundancy is not really the issue, though.  What do we care about in that output?  Certainly we want the coefficient, and quite possibly we want the standard error (if we are going to hand compute confidence intervals, for instance).  If we care about statistical significance (and we generally should), we want to see the p-value.  Since the sole reason for computing the T-ratio is to get to the p-value, I consider it a waste of space.  I suspect it is a holdover from bygone days when software did not compute p-values, and users had to look up the T-ratio in tables.

Meanwhile, there is something missing from the standard output that I think really should be there:  the variance inflation factors (VIFs).  VIFs give you a quick diagnostic check of whether you need be concerned about multicollinearity.  In a doctoral seminar on regression that I've taught in recent years, we read papers from the business and social science literature in which regression models are presented (and used to test hypotheses).  The vast majority of these papers contain no indication that the authors checked for multicollinearity.  In a few cases, the authors state that because the correlation matrix of the predictors exhibits no large pairwise correlations, multicollinearity is not a concern.  This sense of security is entirely unfounded when there are more than two predictors, as multicollinearity can occur without any large pairwise correlations.

Perhaps if we saw VIFs every time we ran a multiple regression, we would get in the habit of spot-checking for multicollinearity.  Most software will produce VIFs, but you may have to dig to find out how to get them.  (In R, for instance, I have to load the "car" library to find a VIF function.)  If software vendors are concerned about real estate on the screen/page, I nominate the T-ratio column as something we can sacrifice to make room for VIFs.