## Wednesday, October 21, 2015

### Regression Via Pseudoinverse

In my last post (OLS Oddities), I mentioned that OLS linear regression could be done with multicollinear data using the Moore-Penrose pseudoinverse. I want to tidy up one small loose end.

Specifically, let $X$ be the matrix of predictor observations (including a column of ones if a constant term is desired), let $y$ be a vector of observations of the dependent variable, and suppose that you want to fit the model $y = X\beta + \epsilon$ where $\epsilon$ is the noise term and $\beta$ the coefficient vector. The normal equations $$b = \left(X^\prime X\right)^{-1}X^\prime y$$produce the least squares estimate of $\beta$ when $X$ has full column rank. If we let $M^+$ denote the Moore-Penrose pseudoinverse of matrix $M$ (which always exists and is unique), then $$\hat{b} = X^+ y$$results in $\hat{y} = X\hat{b}$ giving the correct fitted values even when $X$ has less than full rank (i.e., when the predictors are multicollinear).

What if you replace the inverse with a pseudoinverse in the normal equations? In other words, suppose we let $$\tilde{b} = \left(X^\prime X\right)^+X^\prime y.$$Do we get the same fitted values $\hat{y}$?

We do, and in fact $\tilde{b} = \hat{b}$, i.e., both ways of using the pseudoinverse produce the same coefficient vector. The reason is that $$\left(X^\prime X\right)^+X^\prime = X^+.$$A proof is given in section 4.2 of the Wikipedia page of "Proofs involving the Moore-Penrose pseudoinverse", so I won't bother to reproduce it here.

If $X$ is $m \times n$, the second approach will be preferable only if the computational cost of finding the pseudoinverse of the $n \times n$ matrix $X^\prime X$ is sufficiently less than the cost of finding the pseudoinverse of $X$ to offset the $O\left(mn^2\right)$ cost of the multiplication of $X^\prime$ and $X$. I don't know if that's true, particularly in some machine learning applications where, apparently, $n >> m$. So I'll either stick to the simpler version (using $X^+$) or, more likely, continue with the time-honored tradition of weeding out redundant predictors before fitting the model.