In my last post (OLS Oddities), I mentioned that OLS linear regression could be done with multicollinear data using the Moore-Penrose pseudoinverse. I want to tidy up one small loose end.
Specifically, let $X$ be the matrix of predictor observations (including a column of ones if a constant term is desired), let $y$ be a vector of observations of the dependent variable, and suppose that you want to fit the model $y = X\beta + \epsilon$ where $\epsilon$ is the noise term and $\beta$ the coefficient vector. The normal equations $$b = \left(X^\prime X\right)^{-1}X^\prime y$$produce the least squares estimate of $\beta$ when $X$ has full column rank. If we let $M^+$ denote the Moore-Penrose pseudoinverse of matrix $M$ (which always exists and is unique), then $$\hat{b} = X^+ y$$results in $\hat{y} = X\hat{b}$ giving the correct fitted values even when $X$ has less than full rank (i.e., when the predictors are multicollinear).
What if you replace the inverse with a pseudoinverse in the normal equations? In other words, suppose we let $$\tilde{b} = \left(X^\prime X\right)^+X^\prime y.$$Do we get the same fitted values $\hat{y}$?
We do, and in fact $\tilde{b} = \hat{b}$, i.e., both ways of using the pseudoinverse produce the same coefficient vector. The reason is that $$\left(X^\prime X\right)^+X^\prime = X^+.$$A proof is given in section 4.2 of the Wikipedia page of "Proofs involving the Moore-Penrose pseudoinverse", so I won't bother to reproduce it here.
If $X$ is $m \times n$, the second approach will be preferable only if the computational cost of finding the pseudoinverse of the $n \times n$ matrix $X^\prime X$ is sufficiently less than the cost of finding the pseudoinverse of $X$ to offset the $O\left(mn^2\right)$ cost of the multiplication of $X^\prime$ and $X$. I don't know if that's true, particularly in some machine learning applications where, apparently, $n >> m$. So I'll either stick to the simpler version (using $X^+$) or, more likely, continue with the time-honored tradition of weeding out redundant predictors before fitting the model.
Wednesday, October 21, 2015
Sunday, October 18, 2015
OLS Oddities
During a couple of the lectures in the Machine Learning MOOC offered by Prof. Andrew Ng of Stanford University, I came across two statements about ordinary least squares linear regression (henceforth OLS) that surprised me. Given that I taught regression for years, I was surprised that I could be surprised (meta-surprised?), but these two facts are not ones you're likely to see in a statistics course.
The first is not too shocking in hindsight. Let $X$ be an $m \times n$ matrix containing $m$ observations of $n$ predictors (including a column of ones if you want a constant term in the model) and let $y$ be an $m \times 1$ vector of observations of the dependent variable. A linear model looks like $$y = Xb + e$$where $b$ is the estimated coefficient vector ($n \times 1$) and $e$ is the vector of residuals ($m \times 1$). By definition OLS computes $b$ to minimize the sum of squared residuals $$J(b) = e^\prime e.$$The so-called normal equations are derived by computing the gradient of $J$ and setting it equal to zero. Assuming $X$ has full column rank, this yields $$b = (X^\prime X)^{-1}X^\prime y.$$ Surprising (to me) fact #1: when dealing with large data sets, you may not want to use the nice, closed form, normal equations to compute $b$. The alternative is an iterative approach, using a nonlinear programming algorithm such as gradient descent, or the conjugate gradient method, or some other alternative. The reason is that $(X^\prime X)^{-1}$ may be computationally expensive to compute when $m$ and particularly $n$ are large. In practice, good statistical software would not invert the matrix anyway; it would do some sort of decomposition (LU, decomposition, whatever) both for efficiency and to head off potential rounding problems. Apparently, though, the comparatively small amount of arithmetic required per iteration of gradient descent (mainly computing $X b$ and $X^\prime e$, which are $O(mn)$ operations) offsets the cost of running a bunch of iterations rather doing a one-and-done solution of a matrix equation.
The bigger surprise, though, had to do with multicollinearity, which occurs when $X$ has less than full column rank. Multicollinearity means $X^\prime X$ is singular and cannot be inverted. It also means the model contains redundant predictors (some predictors are linear combinations of others), and I always gave my students the standard prescription: figure out which predictors were redundant and eliminate them. Other people sometimes recommend a perturbation approach (ridge regression). For that matter, gradient descent should work properly with multicollinear data.
Before continuing, I need to inject a well-established if not necessarily well-known fact about multicollinearity. Whether $X$ is multicollinear or not, there is a unique vector $\hat{y}$ of fitted values that minimizes the sum of squared errors. This is because $\hat{y}$ is the orthogonal projection of $y$ onto the column space of $X$, and the orthogonal project of any vector onto any linear subspace is unique. When $X$ has full rank, $\hat{y} =Xb$ where $b$ is given by the normal equations (and is the unique coefficient vector that generates $\hat{y}$. When $X$ is less than full rank (multicollinear), the optimal fitted values $\hat{y}$ can be obtained from any of an infinite number of coefficient vectors ... but the fits are still uniquely determined.
The big surprise for me: it turns out that you can fit a version of the normal equations using the Moore-Penrose pseudoinverse of $X$. I'd come across the pseudoinverse in some long ago linear algebra or numerical analysis class, filed it away as another mathematical curiosity of no particular use, and forgotten it. Oops.
Let $A$ be the pseudoinverse of $X$ (which always exists, is unique, and has dimensions $n \times m$). Then $\hat{b} = Ay$ yields the correct fitted values $X \hat{b}$. A proof begins with one of the identities true of the pseudoinverse: $$X^\prime = X^\prime X A.$$Now the gradient $\nabla_b J(b)$ works out to $-2X^\prime e$; substitute $y - XAy$ for $e$ and $\hat{b}$ for $b$, and we have \begin{alignat*}{1} \nabla_{b}J(\hat{b}) & =-2X^{\prime}e\\ & =-2X^{\prime}\left(y-X\hat{b}\right)\\ & =-2X^{\prime}\left(y-XAy\right)\\ & =-2X^{\prime}(I-XA)y\\ & =-2\left(X^{\prime}-X^{\prime}XA\right)y. \end{alignat*}Substitute the aforementioned identity for the middle expression and we have $ \nabla_{b}J(\hat{b}) = 0$. So choosing $b=\hat{b}$ minimizes squared errors and thus gives the proper fitted values.
All that said, I would still rather eliminate redundant predictors. That process can be automated, but if you are writing a machine learning algorithm that will chew through potentially large data sets with no human supervision, and you want to use linear (or polynomial) models, I suppose you should at least consider using the entire data set with the pseudoinverse taking care of the multicollinearity.
The first is not too shocking in hindsight. Let $X$ be an $m \times n$ matrix containing $m$ observations of $n$ predictors (including a column of ones if you want a constant term in the model) and let $y$ be an $m \times 1$ vector of observations of the dependent variable. A linear model looks like $$y = Xb + e$$where $b$ is the estimated coefficient vector ($n \times 1$) and $e$ is the vector of residuals ($m \times 1$). By definition OLS computes $b$ to minimize the sum of squared residuals $$J(b) = e^\prime e.$$The so-called normal equations are derived by computing the gradient of $J$ and setting it equal to zero. Assuming $X$ has full column rank, this yields $$b = (X^\prime X)^{-1}X^\prime y.$$ Surprising (to me) fact #1: when dealing with large data sets, you may not want to use the nice, closed form, normal equations to compute $b$. The alternative is an iterative approach, using a nonlinear programming algorithm such as gradient descent, or the conjugate gradient method, or some other alternative. The reason is that $(X^\prime X)^{-1}$ may be computationally expensive to compute when $m$ and particularly $n$ are large. In practice, good statistical software would not invert the matrix anyway; it would do some sort of decomposition (LU, decomposition, whatever) both for efficiency and to head off potential rounding problems. Apparently, though, the comparatively small amount of arithmetic required per iteration of gradient descent (mainly computing $X b$ and $X^\prime e$, which are $O(mn)$ operations) offsets the cost of running a bunch of iterations rather doing a one-and-done solution of a matrix equation.
The bigger surprise, though, had to do with multicollinearity, which occurs when $X$ has less than full column rank. Multicollinearity means $X^\prime X$ is singular and cannot be inverted. It also means the model contains redundant predictors (some predictors are linear combinations of others), and I always gave my students the standard prescription: figure out which predictors were redundant and eliminate them. Other people sometimes recommend a perturbation approach (ridge regression). For that matter, gradient descent should work properly with multicollinear data.
Before continuing, I need to inject a well-established if not necessarily well-known fact about multicollinearity. Whether $X$ is multicollinear or not, there is a unique vector $\hat{y}$ of fitted values that minimizes the sum of squared errors. This is because $\hat{y}$ is the orthogonal projection of $y$ onto the column space of $X$, and the orthogonal project of any vector onto any linear subspace is unique. When $X$ has full rank, $\hat{y} =Xb$ where $b$ is given by the normal equations (and is the unique coefficient vector that generates $\hat{y}$. When $X$ is less than full rank (multicollinear), the optimal fitted values $\hat{y}$ can be obtained from any of an infinite number of coefficient vectors ... but the fits are still uniquely determined.
The big surprise for me: it turns out that you can fit a version of the normal equations using the Moore-Penrose pseudoinverse of $X$. I'd come across the pseudoinverse in some long ago linear algebra or numerical analysis class, filed it away as another mathematical curiosity of no particular use, and forgotten it. Oops.
Let $A$ be the pseudoinverse of $X$ (which always exists, is unique, and has dimensions $n \times m$). Then $\hat{b} = Ay$ yields the correct fitted values $X \hat{b}$. A proof begins with one of the identities true of the pseudoinverse: $$X^\prime = X^\prime X A.$$Now the gradient $\nabla_b J(b)$ works out to $-2X^\prime e$; substitute $y - XAy$ for $e$ and $\hat{b}$ for $b$, and we have \begin{alignat*}{1} \nabla_{b}J(\hat{b}) & =-2X^{\prime}e\\ & =-2X^{\prime}\left(y-X\hat{b}\right)\\ & =-2X^{\prime}\left(y-XAy\right)\\ & =-2X^{\prime}(I-XA)y\\ & =-2\left(X^{\prime}-X^{\prime}XA\right)y. \end{alignat*}Substitute the aforementioned identity for the middle expression and we have $ \nabla_{b}J(\hat{b}) = 0$. So choosing $b=\hat{b}$ minimizes squared errors and thus gives the proper fitted values.
All that said, I would still rather eliminate redundant predictors. That process can be automated, but if you are writing a machine learning algorithm that will chew through potentially large data sets with no human supervision, and you want to use linear (or polynomial) models, I suppose you should at least consider using the entire data set with the pseudoinverse taking care of the multicollinearity.
Tuesday, October 6, 2015
Producing Reproducible R Code
A tip in the Google+ Statistics and R community led me to the reprex package for R. Quoting the author (Professor Jennifer Bryan, University of British Columbia), the purpose of reprex is to
I'm all for complete and cogent bug reports/code questions and reproducible research, but I was interested in reprex for another reason: formatting R code for blog posts (such as this one). To date I've been using a third party web site (the Pretty R syntax highlighter) to generate HTML from R code, and I've been quite happy with the results. A simpler process would be nice, though. Additional, while the aforementioned site works great with the code, I'm sometimes not sure how I should format the output.
So I decided to take prerex for a test drive using code from an old post here (Tabulating Prediction Intervals in R). I used just the code from the first part of the post (definition of the model.ctable() function and one invocation of it), a total of 17 lines of source code (including Roxygen comments for the function) leading to a single output table. Using RStudio, my work flow was as follows.
That created a border and a bit of padding, and told the browser to add scroll bars if needed. Here is how my example turned out:
You can see the comments, the code and, at the end, the output (formatted as R comments). It's not perfect. In particular, it would be nice if the Roxygen comments looked like comments and not like text. There's also no syntax highlighting (which is to be expected in an R Markdown document). Still, it's not bad for a blog post, and it confirms the package works (and is easy to use).
I'll close by pointing out that I'm going "off label" by using the package this way. In particular, I'm getting no value from one of the prime virtues of R Markdown: the ability to embed code in a text document such that the code can be easily read but can also be executed by "compiling" the document (not true of an HTML document like this post). For posting code to a forum, though, this looks like a definite keeper.
[r]ender reproducible example code to Markdown suitable for use in code-oriented websites, such as StackOverflow.com or GitHub.Much has been written about the virtues of, and need for, reproducible research. Another key need for reproducibility, one at which this package aims, is when posting questions about code or bug reports. Viewers of those posts need to know exactly what you did and exactly what resulted. The readme text on the package's GitHub home page gives a work flow description and some prescriptive advice, which I think is well worth reading.
I'm all for complete and cogent bug reports/code questions and reproducible research, but I was interested in reprex for another reason: formatting R code for blog posts (such as this one). To date I've been using a third party web site (the Pretty R syntax highlighter) to generate HTML from R code, and I've been quite happy with the results. A simpler process would be nice, though. Additional, while the aforementioned site works great with the code, I'm sometimes not sure how I should format the output.
So I decided to take prerex for a test drive using code from an old post here (Tabulating Prediction Intervals in R). I used just the code from the first part of the post (definition of the model.ctable() function and one invocation of it), a total of 17 lines of source code (including Roxygen comments for the function) leading to a single output table. Using RStudio, my work flow was as follows.
- Open a new script file and type/paste the code into it.
- Source the file to confirm it works as expected.
- Copy the code to the clipboard.
- In the console window, run the following two lines.
library(reprex) reprex()
This runs the code in the clipboard, so be careful not to do anything to modify the clipboard contents between the previous step and this one. - Examine the results in the viewer pane (which automatically opens) to confirm that is as expected.
- Open a new R Markdown file, delete the boilerplate RStudio inserts, and paste the contents of the clipboard into it. Along with displaying results in the viewer, the reprex() function also places the R Markdown code for it in the clipboard. Again, be careful not to modify the clipboard contents between step 4 and this one.
- Click the "Knit HTML" button and provide a destination file for the HTML output. This opens an HTML source file in RStudio.
- Copy the contents of the body tag (excluding the opening and closing body tags and ignoring the pile of stuff in the header) and paste into an HTML document. (Depending on the width of the output, you might want to surround it with a scrolling DIV tag, or hack the CSS you just pasted in to make it scrollable and/or give it a border.)
overflow: scroll; border-style: groove; border-width: 5px; padding: 10px;
That created a border and a bit of padding, and told the browser to add scroll bars if needed. Here is how my example turned out:
Summarize a fitted linear model, displaying both coefficient significance and confidence intervals.
@param model an instance of class lm @param level the confidence level (default 0.95)
@return a matrix combining the coefficient summary and confidence intervals
@param model an instance of class lm @param level the confidence level (default 0.95)
@return a matrix combining the coefficient summary and confidence intervals
model.ctable <- function(model, level = 0.95) {
cbind(summary(model)$coefficients, confint(model, level = level))
}
x <- rnorm(20)
y <- rnorm(20)
z <- 6 + 3 * x - 5 * y + rnorm(20)
m <- lm(z ~ x + y)
model.ctable(m, level = 0.9)
#> Estimate Std. Error t value Pr(>|t|) 5 %
#> (Intercept) 6.271961 0.2462757 25.46724 5.584261e-15 5.843539
#> x 2.974000 0.2571237 11.56642 1.763158e-09 2.526706
#> y -4.951286 0.3260552 -15.18542 2.547338e-11 -5.518494
#> 95 %
#> (Intercept) 6.700384
#> x 3.421294
#> y -4.384079
You can see the comments, the code and, at the end, the output (formatted as R comments). It's not perfect. In particular, it would be nice if the Roxygen comments looked like comments and not like text. There's also no syntax highlighting (which is to be expected in an R Markdown document). Still, it's not bad for a blog post, and it confirms the package works (and is easy to use).
I'll close by pointing out that I'm going "off label" by using the package this way. In particular, I'm getting no value from one of the prime virtues of R Markdown: the ability to embed code in a text document such that the code can be easily read but can also be executed by "compiling" the document (not true of an HTML document like this post). For posting code to a forum, though, this looks like a definite keeper.
Subscribe to:
Posts (Atom)