A while back, I wrote a Shiny application for a colleague who teaches basic linear regression in a few courses. She recently asked me to add an ANOVA table. I thought the change would be trivial -- R has at least two functions in the stats package, anova() and aov(), for generating ANOVA output -- but there was a catch. Because the students start in Excel before switching to the Shiny app, she wanted me to mimic Excel's output.
The following is Excel's version of an ANOVA table for a small model with independent variables X1 and X2 and dependent variable Y.
| df | Sum of Squares | Mean Square | F | Significance F | |
|---|---|---|---|---|---|
| Regression | 2 | 193.4887725 | 96.74438625 | 334.348849898829 | 6.64178639346989E-12 |
| Residual | 13 | 3.76157124999999 | 0.289351634615384 | ||
| Total | 15 | 197.25034375 |
I'm not sure that the column headings are gospel -- my colleague might have tweaked them a bit -- but the important thing is that there are three rows: "Regression" (variation attributable to the model as a whole); "Residual" (variation attributable to the residuals or "errors"); and "Total", variation of the dependent variable around its mean. Contrast that with the output from R's anova() function for the same model.
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | ||
|---|---|---|---|---|---|---|
| X1 | 1 | 183.527 | 183.527 | 634.270 | 2.037e-12 | *** |
| X2 | 1 | 9.962 | 9.962 | 34.428 | 5.524e-05 | *** |
| Residuals | 13 | 3.762 | 0.289 |
Note that R splits the explained variation among the individual predictor variables and omits the total variation around the dependent variable mean. (It also omits a whole lot of decimal places in which I would place little faith.)
I search around for an R function that would generate an ANOVA table in something like Excel's format but came up empty. So I wrote my own, which I will share here in case anyone else has the same need. To combine the individual predictor rows into the first row of the Excel version, I just add the degrees of freedom and sum of squares and then recompute the mean square value, F statistic and probability. To create the bottom row of the Excel version, I regress the dependent variable around just a constant term (lm(Y ~ 1)) and then apply the anova() function to that. It might seem a bit roundabout, but it generates an output that is easily grafted onto the rest of the ANOVA table using rbind(). Here's the R code.
#
# Generate an ANOVA table for a model in the format used by Excel.
#
# @param model the fitted model
# @param data the date set used for fitting
#
# @return the formatted ANOVA table
#
excelANOVA <- function(model, data) {
# Get the dependent variable.
dv <- (model |> formula() |> all.vars())[1]
# Regress the d.v. against just a constant term.
model2 <- lm(paste0(dv, " ~ 1"), data = data)
# Run ANOVAs for both models.
a <- anova(model)
b <- anova(model2)
# Get the number of predictors and the row count of the original ANOVA.
r <- nrow(a) # total rows
p <- r - 1 # number of predictors
# Put the degrees of freedom, sum of squares and mean squares for the full model in row p.
a[p, 1] <- p # df
a[p, 2] <- sum(a[1:p, 2]) # ss
a[p, 3] <- a[p, 2]/p # ms
# Drop the rows for individual predictors, leaving just model and residual rows.
a <- a[p:r, ]
# Fix the F statistic and p-value.
a[1, 4] <- a[1, 3] / a[2, 3] # F statistic
a[1, 5] <- 1 - pf(a[1, 4], a[1, 1], a[2, 1]) # p-value
# Add the row for the constant model.
a <- rbind(a, b)
# Remove the MSE from the total (constant model), since Excel does not show that.
a[3, 3] <- NA
# Tweak the row and column labels to suit the boss.
rownames(a) <- c("Regression", "Residual", "Total")
colnames(a) <- c("df", "Sum of Squares", "Mean Square", "F", "Significance F")
# Return the ANOVA table.
return(a)
}Here is the output it produces on the test instance.
| df | Sum of Squares | Mean Square | F | Significance F | |
|---|---|---|---|---|---|
| Regression | 2 | 193.489 | 96.744 | 334.35 | 6.6418e-12 |
| Residual | 13 | 3.762 | 0.289 | ||
| Total | 15 | 197.250 |
The default behavior of lm() is to encode the data in the lm object, so it might seem that the second argument to the function is redundant. I included it in case someone is working with a large dataset and suppresses the embedding of the data in the model object. One could tweak this to make the second argument optional, with the default being to grab the data from the first argument.
No comments:
Post a Comment
Due to intermittent spamming, comments are being moderated. If this is your first time commenting on the blog, please read the Ground Rules for Comments. In particular, if you want to ask an operations research-related question not relevant to this post, consider asking it on Operations Research Stack Exchange.