During one of the lectures I learned something new that was both interesting and had the side effect of making me feel old (a job my knees are entirely up to on their own, thank you very much). Ridge regression has apparently been picked up from the scrap heap of statistical history and repurposed.
Stiffness: it's not just for us old people
To explain ridge regression as I originally learned it, let me start by defining the phrase ill-conditioned matrix. Let $A$ be a square matrix (we'll operate over the reals, although I think this works as well for complex matrices) and let $\left\Vert \cdot\right\Vert $ be a consistent matrix norm. The condition number of $A$, assuming that is nonsingular, is $$\kappa = \left\Vert A\vphantom{^{-1}}\right\Vert \left\Vert A^{-1}\right\Vert \ge 1.$$ For singular matrices, $\kappa$ is either undefined or $\infty$, according to your religious preferences.
Somewhat informally, $\kappa$ can be interpreted as a measure of how much rounding errors may be inflated then inverting $A$ (using floating-point arithmetic ... "pure" mathematicians don't commit rounding errors). An ill-conditioned or "stiff" matrix $A$ is non-singular but has a large condition number (the criterion for "large" being rather fuzzy), meaning that rounding errors during the process of inverting (or factoring, or otherwise torturing) $A$ may result in an answer with lots of incorrect digits. Long ago, a graduate school instructor told my class the story of an engineer who had inverted an ill-conditioned matrix, without realizing it was ill-conditioned, using home-brew FORTRAN code. He'd published the inverse in a journal article to four or five decimal places. All the decimal places in most of the coefficients were wrong. So were lots of digits to the left of the decimal points.
Ridge regression then
Here comes the tie-in to regression. Suppose we are trying to fit the model $$y=x'\beta+\epsilon$$(where $\epsilon$ is random noise with mean zero), and we have collected observations $(X,Y)$ with $X$ an $n\times m$ matrix ($n$ = sample size, $m$ = number of predictors). The normal equations tell us that the least squares estimate for $\beta$ is $$b=(X^\prime X)^{-1}X^\prime Y.$$In theory, $X^\prime X$ is nonsingular unless the predictors are perfectly multicollinear. In practice, if $X^\prime X$ is nonsingular but stiff, our coefficient estimate $b$ can suffer drastic rounding errors. To mitigate those errors, ridge regression tampers with $X^\prime X$, adding a small positive multiple of the identity matrix of dimension $m$. The modified normal equations are$$b_\lambda=(X^\prime X + \lambda I)^{-1}X^\prime Y.$$The result $b_\lambda$ is a deliberately biased estimate of $\beta$, in which we trade the bias for reduced rounding error. We generally choose $\lambda > 0$ as small as possible subject to the need to stabilize the numerics.
When I first encountered ridge regression, we were in the (first) mainframe era. Machine words were small, single-precision arithmetic ruled the computing universe, and rounding errors were relatively easy to generate. Somewhere along the way, computing hardware got smaller and cheaper, double precision arithmetic became the norm and separate floating point units took over the computational load. At the same time, matrix factorization replaced matrix inversion, and factoring algorithms got smarter about managing rounding error. The problems with stiff $X^\prime X$ matrices seemed largely to disappear, and references to ridge regression became uncommon, at least in my experience.
Ridge regression now
In the Statistical Learning course, ridge regression makes a resurgence, with an entirely different purpose. In the era of Big Data, there is a distinct danger that we may overfit data by throwing in lots and lots of predictors (large dimension $m$). One way to mitigate that temptation is to penalize the number or size of the coefficients in the model.
The normal equations minimize the sum of squared deviations in a linear regression model. In other words, the ordinary least squares estimate of $\beta$ solves the problem$$\min_b \Vert Y-X b\Vert^2.$$Given that we are squaring errors, and given that quadratic objective functions have nice smoothness, a tempting way to penalize overfitting is to add a penalty for the size of the coefficient vector:$$\min_b \Vert Y-X b\Vert^2 + \lambda \Vert b \Vert^2$$where $\lambda > 0$ is a penalty parameter. If $\lambda = 0$, we are back to ordinary least squares. As $\lambda \rightarrow \infty$, $\Vert b \Vert$ will converge toward (but never quite reach) 0.
I'll skip the algebra and just state that solving this problem for fixed $\lambda$ is mathematically equivalent to ridge regression as seen above. The only difference is in the choice of $\lambda$. When we were fighting rounding error, we wanted the smallest $\lambda$ that would clean up most of the rounding error. Now, when we are fighting overfitting, we probably need larger values of $\lambda$ to get the job done.
So everything old is new again ... which bodes well for my knees.
No comments:
Post a Comment
Due to intermittent spamming, comments are being moderated. If this is your first time commenting on the blog, please read the Ground Rules for Comments. In particular, if you want to ask an operations research-related question not relevant to this post, consider asking it on Operations Research Stack Exchange.