Skip to content

Understanding how Anova relates to regression

Analysis of variance (Anova) models are a special case of multilevel regression models, but Anova, the procedure, has something extra: structure on the regression coefficients.

As I put it in the rejoinder for my 2005 discussion paper:

ANOVA is more important than ever because we are fitting models with many parameters, and these parameters can often usefully be structured into batches. The essence of “ANOVA” (as we see it) is to compare the importance of the batches and to provide a framework for efficient estimation of the individual parameters and related summaries such as comparisons and contrasts. . . .

A statistical model is usually taken to be summarized by a likelihood, or a likelihood and a prior distribution, but we go an extra step by noting that the parameters of a model are typically batched, and we take this batching as an essential part of the model. . . .

A key technical contribution of our paper is to disentangle modeling and inferential summaries. A single multilevel model can yield inference for finite-population and superpopulation inferences. . . .

I summarize:

First, if you are already fitting a complicated model, your inferences can be better understood using the structure of that model.Second, if you have a complicated data structure and are trying to set up a model, it can help to use multilevel modeling—not just a simple units-within-groups structure but a more general approach with crossed factors where appropriate. . . .

I’m sharing this with you now because Josh Miller pointed me to this webpage by Jonas Kristoffer Lindeløv entitled “Common statistical tests are linear models (or: how to teach stats).”

Lindeløv’s explanations are good, and I do think it’s useful for students and practitioners to understand that all these statistical procedures are based on the same class of underlying model. He also notes that the Wilcoxon rank test can be formulated approximately as a linear model on ranks, a point that we put in BDA and which I’ve occasionally blogged (see here and here). It’s good to see these ideas being rediscovered: they’re useful enough that they shouldn’t be trapped within a single book and a few old blog entries.

The point of my post today is to emphasize that it’s not just what model you fit, it’s also how you summarize it. To put it another way, I think the unification of statistical comparisons is taught to everyone in econometrics 101, and indeed this is a key theme of my book with Jennifer, in that we use regression as an organizing principle for applied statistics. (Just to be clear, I’m not claiming that we discovered this. Quite the opposite. I’m saying that we constructed our book in large part based on the understanding we’d gathered from basic ideas in statistics and econometrics that we felt had not fully been integrated into how this material was taught.)

So, it’s well known that all these models are a special case of regression, and that’s why in a good econometrics class they won’t bother teaching Anova, chi-squared tests, etc., they just do regression. My Anova paper demonstrates how the concept of Anova has value, not just from the model (which is just straightforward multilevel linear regression) but because of the structured way the fits are summarized.

For more, go to my Anova article or, for something quicker, these old blog posts:
Anova for economists
A psychology researcher asks: Is Anova dead?
Anova is great—if you interpret it as a way of structuring a model, not if you focus on F tests.

I think these are important points: the connection between the statistical models, and also the extra understanding that arises from batching and summarizing by batch.


  1. Garnett says:

    Are there any thoughts on multi-level models for the variance components?
    We often see variability among subjects/classes/groups in the variability of their responses.

    My usual approach is to hierarchically model the variance components as lognormal, but that doesn’t retain the benefits of half-normal priors on the variances. Is there something like a multi-level half-normal model?

    • Christopher says:

      One approach I’ve played with is based on diagonal component of rstanarm’s decov() priors. If you have K groups, you can put a half normal / half t prior on the average standard deviation (sigma-bar) and use a simplex (phi) with a symmetrical dirichlet distribution to describe how evenly the variance is distributed among the groups.

      sigma_bar ~ normal(0, s); // s is hyper parameter; sigma_bar > 0
      phi[1:K] ~ dirichlet(a,a,a,…,a) // a is a hyper parameter
      sigma[i] = sigma_bar * sqrt(K * phi[i]); // mean(sigma[1:k]^2) == sigma_bar^2;

      It has worked well enough the couple of times I’ve used it

      • Garnett says:

        Neat idea! I’ll try this out right away.

      • NPope says:

        Great suggestion Christopher — thanks! — and of course weights can be arbitrary (eg depending on covariates via suitable transform, or following any strictly positive distribution) without changing the family of the prior for scale invariant families. This is the same strategy used in nlme, gamlss, etc, for modelling heteroskedasticity, sans the prior…

  2. Jeff Walker says:

    For me, I find it more helpful to think of regression and ANOVA as special cases of linear models (or, or okay, generalized linear models) – the reason being that “regression” comes with some baggage — “regression” was developed as (and is still often taught as, at least in intro bio stats like classes) models with continuous X and “ANOVA” was developed as (and often taught as, at least in intro bio stats like classes) models with categorical X. “linear model” doesn’t come with this baggage or pre-conception. This is all semantic but perhaps not un-important for pedagogy. For someone in Biology trained using say Sokal and Rohlf’s Biometry or Zar’s Biostatistical Analysis, your book, your book “Data Analysis Using Regression and Multilevel/Hierarchical Models” or Harrell’s book “Regression Modeling Strategies” might not seem relevant to experimental biologists who think of regression as models with (largely) continuous (and generally observational) X.

    • Andrew says:


      The regression and Anova models are a special case of generalized linear models. But Anova is not just a statistical model, it’s also a way of structuring and displaying the model, batching coefficients and comparing their variances.

      You’re raising a different important point which is that statisticians typically focus on the outcome variable (continuous, binary, count, zero-inflated count, etc), whereas practitioners often focus on the predictors (discrete, continuous, etc). So I agree there can be communication difficulties. Perhaps we can clarify this in Regression in Other Stories.

      • Jeff Walker says:

        I revisit your ANOVA paper every year and I think it’s time for a refresher. I have a general repulsion to ANOVA tables (I don’t find them that informative) but I think because most papers that I read use them as summary tables of null hypothesis tests (and admittedly, this is the way I read them because this is how it was taught to me).

        • Chris Wilson says:

          I’m with Jeff. I really like the graphical ANOVA aesthetically, and as a way to summarize multi-level models. It is a communication challenge to invoke the whole framework of ANOVA, but then say “don’t interpret this like the ANOVA you were taught”. I would be curious to see more examples of its use in publications.

  3. One point that I feel is largely neglected in connection with ANOVA is the fact that in psychology and related areas, we first do an ANOVA, and only if we find a significant F-score do we do “post-hoc” comparisons. You don’t need to do that; just test your hypotheses right there and then, using the appropriate contrast coding (which allows structuring of parameters into batches—I think I am using this phrase in the sense Andrew meant it, not sure). People like Thom Baguley and Hays have written about this in the past; it’s not a new thought, it’s just not widely known.

    We’ve written a tutorial paper on this topic, it may be of interest to readers of this blog: Comments are welcome.

    • I think it’s difficult even for the mathematically sophisticated to properly code the appropriate contrasts, and then when you do, to get the right answer unless the design has exactly the same number of cases in each batch, (ie. in an unbalanced design). Let’s just give up on this formalism that is in my opinion a kind of special case of the multilevel model, and move on to coding the model directly in Stan and fitting it with moderately informative priors ;-)

      Also the contrasts need to be orthogonal to each other… but we have many questions that are not geometrically orthogonal.

      • Sure, I agree that we should focus on fitting models directly in Stan. (In fact when people ask me why I don’t teach ANOVA I say, of course I do, look at this multiple regression model that I teach.) Why do the contrasts *have* to be orthogonal to each other?

        • Why do they have to be orthogonal? I’ve never really worked out the math but my guess is that it has something to do with the way ANOVA works, you are basically seeking the solution to a quadratic minimization problem. You turn that into a matrix algebra problem by taking derivatives and setting them equal to zero. Now you don’t have a solvable problem so you add on some auxiliary assumptions, namely that the sum of the coefficients is zero at each level. Now you can change basis from the coefficients of each predictor to the coefficients of the contrast vectors. This change of basis must be to a new linearly independent complete basis. Technically it doesn’t have to be orthogonal but if it isn’t then the values of the coefficients are linearly related to each other. The standard errors become non independent, and interpretation becomes difficult because we don’t have a nice posterior distribution to show how the uncertainty in coefficients is co dependent.

          That’s all a guess. But what I do know is that if you sample from the posterior of a proper Bayesian model you automatically can answer any question you want about interdependencies between the variables. The orthogonal linear algebra stuff basically insists on geometrically enforced independence.

Leave a Reply