OR in an OB World: Excel ANOVA Format in R

Thursday, July 2, 2026

Excel ANOVA Format in R

A while back, I wrote a Shiny application for a colleague who teaches basic linear regression in a few courses. She recently asked me to add an ANOVA table. I thought the change would be trivial -- R has at least two functions in the stats package, anova() and aov(), for generating ANOVA output -- but there was a catch. Because the students start in Excel before switching to the Shiny app, she wanted me to mimic Excel's output.

The following is Excel's version of an ANOVA table for a small model with independent variables X1 and X2 and dependent variable Y.

	df	Sum of Squares	Mean Square	F	Significance F
Regression	2	193.4887725	96.74438625	334.348849898829	6.64178639346989E-12
Residual	13	3.76157124999999	0.289351634615384
Total	15	197.25034375

I'm not sure that the column headings are gospel -- my colleague might have tweaked them a bit -- but the important thing is that there are three rows: "Regression" (variation attributable to the model as a whole); "Residual" (variation attributable to the residuals or "errors"); and "Total", variation of the dependent variable around its mean. Contrast that with the output from R's anova() function for the same model.

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
X1	1	183.527	183.527	634.270	2.037e-12	***
X2	1	9.962	9.962	34.428	5.524e-05	***
Residuals	13	3.762	0.289

Note that R splits the explained variation among the individual predictor variables and omits the total variation around the dependent variable mean. (It also omits a whole lot of decimal places in which I would place little faith.)

I search around for an R function that would generate an ANOVA table in something like Excel's format but came up empty. So I wrote my own, which I will share here in case anyone else has the same need. To combine the individual predictor rows into the first row of the Excel version, I just add the degrees of freedom and sum of squares and then recompute the mean square value, F statistic and probability. To create the bottom row of the Excel version, I regress the dependent variable around just a constant term (lm(Y ~ 1)) and then apply the anova() function to that. It might seem a bit roundabout, but it generates an output that is easily grafted onto the rest of the ANOVA table using rbind(). Here's the R code.

#
# Generate an ANOVA table for a model in the format used by Excel.
#
# @param model the fitted model
# @param data the date set used for fitting
#
# @return the formatted ANOVA table
#
excelANOVA <- function(model, data) {
  # Get the dependent variable.
  dv <- (model |> formula() |> all.vars())[1]
  # Regress the d.v. against just a constant term.
  model2 <- lm(paste0(dv, " ~ 1"), data = data)
  # Run ANOVAs for both models.
  a <- anova(model)
  b <- anova(model2)
  # Get the number of predictors and the row count of the original ANOVA.
  r <- nrow(a) # total rows
  p <- r - 1   # number of predictors
  # Put the degrees of freedom, sum of squares and mean squares for the full model in row p.
  a[p, 1] <- p                   # df
  a[p, 2] <- sum(a[1:p, 2])      # ss
  a[p, 3] <- a[p, 2]/p           # ms
  # Drop the rows for individual predictors, leaving just model and residual rows.
  a <- a[p:r, ]
  # Fix the F statistic and p-value.
  a[1, 4] <- a[1, 3] / a[2, 3]                  # F statistic
  a[1, 5] <- 1 - pf(a[1, 4], a[1, 1], a[2, 1])  # p-value
  # Add the row for the constant model.
  a <- rbind(a, b)
  # Remove the MSE from the total (constant model), since Excel does not show that.
  a[3, 3] <- NA
  # Tweak the row and column labels to suit the boss.
  rownames(a) <- c("Regression", "Residual", "Total")
  colnames(a) <- c("df", "Sum of Squares", "Mean Square", "F", "Significance F")
  # Return the ANOVA table.
  return(a)
}

Here is the output it produces on the test instance.

	df	Sum of Squares	Mean Square	F	Significance F
Regression	2	193.489	96.744	334.35	6.6418e-12
Residual	13	3.762	0.289
Total	15	197.250

The default behavior of lm() is to encode the data in the lm object, so it might seem that the second argument to the function is redundant. I included it in case someone is working with a large dataset and suppresses the embedding of the data in the model object. One could tweak this to make the second argument optional, with the default being to grab the data from the first argument.

OR in an OB World

Thursday, July 2, 2026

Excel ANOVA Format in R

No comments:

Post a Comment

Previous Posts

Labels