Skip to content

Anova is great—if you interpret it as a way of structuring a model, not if you focus on F tests

Shravan Vasishth writes:

I saw on your blog post that you listed aggregation as one of the desirable things to do. Do you agree with the following argument? I want to point out a problem with repeated measures ANOVA in talk:

In a planned experiment, say a 2×2 design, when we do a repeated measures ANOVA, we aggregate all responses by subject for each condition. This actually leads us to underestimate the variability within subjects. The better way is to use linear mixed models (even in balanced designs) because they allow us to stay faithful to the experiment design and to describe how we think the data were generated.

The issue is that in a major recent paper the authors did an ANOVA after they fail to get statistical significance with lmer. Even ignoring the cheating and p-value chasing aspect of it, I think that using ANOVA is statistically problematic for the above reason alone.

My response: Yes, this is consistent with what I say in my 2005 Anova paper, I think. But I consider that sort of hierarchical model to be a (modern version of) Anova. As a side note, classical Anova is kinda weird because it is mostly based on point estimates of variance parameters. But classical textbook examples are typically on the scale of 5×5 datasets, and in these cases the estimated variances are very noisy.


  1. Thanks for bringing this up again. I just wanted to add something to this:

    In BDA3 you also point out (I don’t have the book by me right now, so I can’t say exactly where, probably ch 5) that it’s important to try to fit a full variance-covariance structure, I think this was intended to reflect one’s assumptions about how the data were generated. This sentiment is reflected in a recent paper in the Journal of Memory and Language:

    title={Random effects structure for confirmatory hypothesis testing: Keep it maximal},
    author={Barr, Dale J and Levy, Roger and Scheepers, Christoph and Tily, Harry J},
    journal={Journal of Memory and Language},

    The recommendation to keep it ”maximal” (i.e., to fit a full variance covariance matrix) is along the lines of your recommendation. But this recommendation ignores an important detail: for the data-sets psychologists and psycholinguists use, a maximal model in the frequentist setting (i.e., using lmer or the like) will give you nonsensical estimates of correlation parameters (see my simulation here: These simulations show that under repeated sampling, either you will get +1 or -1 correlations, leading to a degenerate matrix, or convergence failure.

    Bates, the author of lmer, has commented on this online and pointed out that in such situations one should not be fitting maximal models. Actually, in Gelman and Hill 2007, p. 547, you had also pointed out that often one has to back off from fitting such “maximal” models:

    “Don’t get hung up on whether a coefficient “should” vary by group. Just allow it to vary in the model, and then, if the estimated scale of variation is small …, maybe you can ignore it if that would be more convenient. Practical concerns sometimes limit the feasible complexity of a model–for example, we might fit a varying-intercept model first, then allow slopes to vary, then add group-level predictors, and so forth. Generally, however, it is only the difficulties of fitting and, especially, understanding the models that keeps us from adding even more complexity, more varying coefficients, and more interactions.”

    After the Barr et al paper, journals in psycholinguistics are enforcing the “keep it maximal” recommendation, with the result that people either evade the issue by fitting ANOVAs (creating new opportunities for p<0.05 where none would be found if one were to take all variance components into account), or go ahead and publish results from totally nonsensical models with degenerate variance-covariance matrices. All these models are being fit using lmer.

    The solution I currently work with is to fit a full variance-covariance matrix in Stan with weakly informative priors for the correlation matrices, following the recent recommendations by Chung et al:

    title={Weakly informative prior for point estimation of covariance matrices in hierarchical models},
    author={Chung, Yeojin and Gelman, Andrew and Rabe-Hesketh, Sophia and Liu, Jinchen and Dorie, Vincent},
    journal={Manuscript submitted for publication},

    If I don't have enough data, these correlations are going get dragged to zero with a wide spread. If there is a lot of data, and specifically, if there is enough data to estimate the correlations, the likelihood will dominate and give me a non-zero estimate, if that's what's happening in the data. The fact is that I almost never have enough data to estimate all the correlation parameters, say in a 4×4 variance covariance matrix for the variance components.

    Is this a reasonable way to proceed? I am unsure as to whether I should back off from fitting full variance covariance matrices when most of the posterior distributions of the correlation parameters are centered around zero with huge uncertainty (that is what Gelman and Hill's recommendation suggests that one do–but in those days you didn't have Stan).

    • Ben Bolker says:

      Is the 2013 Chung paper the same as this one?

      Chung, Yeojin, Sophia Rabe-Hesketh, Vincent Dorie, Andrew Gelman, and Jingchen Liu. 2013. “A Nondegenerate Penalized Likelihood Estimator for Variance Parameters in Multilevel Models.” Psychometrika, 1–25. Accessed July 30. doi:10.1007/s11336-013-9328-2.

      If so, you can fit these models with blme , which is a thin wrapper around lme4 that imposes
      the suggested weak priors on variance components (and can also be used to impose priors on
      other model parameters, or stronger priors).

      As I recall Doug Bates doesn’t like this approach — I think he’d rather one simplified the
      model until all of the components were identifiable/non-singular *without* adding priors/penalties
      to the model.

      I don’t think I completely understand your example — if there are multiple responses per subject/treatment combination, and you aggregate (and account for possible lack of balance/
      differential weighting), I don’t see what you’re losing other than an estimate of
      among-response, within-individual variation that you’re not necessarily interested in anyway … ?

      • Hi Ben. You wrote:
        “I don’t think I completely understand your example — if there are multiple responses per subject/treatment combination, and you aggregate (and account for possible lack of balance/differential weighting), I don’t see what you’re losing other than an estimate of
        among-response, within-individual variation that you’re not necessarily interested in anyway … ?”

        My question is this: if I have two conditions in an experiment, and multiple responses from each subject for each condition because I have multiple items, then I have two sources of variance, subject and item. So I would fit a model like

        lmer(y ~ x + (1+x|subj) + (1+x|item),data)

        However, if I aggregate all the responses *by item* for each subject, then I have one data point from each subject for each of the two conditions. Now, I can fit a so-called by-subjects repeated measures anova, or equivalently, a paired t-test. What this does is that it removes the item level variability. In the lmer model above, if the item level variability is non-zero, it will influence the standard error estimates of the fixed effects, which is what I am really interested in as a researcher. Do you agree?

        Here is a concrete example:

        1. Paired t-test, aggregating over items:

        > t.test(subset(subj.means,type==”subj-ext”)$rt,
        + subset(subj.means,type==”obj-ext”)$rt,paired=T)

        Paired t-test

        data: subset(subj.means, type == “subj-ext”)$rt and subset(subj.means, type == “obj-ext”)$rt
        t = 2.6301, df = 36, p-value = 0.01248

        2. lmer fit with non-aggregated data (sum contrast coding):

        Fixed effects:
        Estimate Std. Error t value
        (Intercept) 547.33 53.21 10.287
        type1 -59.85 33.74 -1.774

        The conclusions are very different, and that’s because in the t-test I am able to eliminate one source of variance (items) by averaging over them, artificially reducing the variance components. I think that’s one point that Gelman 2005 makes (Andrew will correct me if I am wrong).

        Ben, do you agree with my characterization of the problem with using t-tests vs lmer? It generalizes to ANOVA if we have, say a 2×2 factorial design.

        • I should have added that I can replicate t-test result in part 1 above using lmer:

          Fixed effects:
          Estimate Std. Error t value
          (Intercept) 551.78 33.59 16.43
          type1 -61.60 23.42 -2.63

          The way that I got significance there for type is by averaging over items. If I allow item-level variability to contribute to the standard error of the fixed effect type, I get the lmer fit with the non-aggregated data above (point 2 above). Now type is not significant.

        • Keith O'Rourke says:

          Shavran: Looks like you _might be_ running into Neyman-Scott problems – an interest parameter and nuisance parameters where with sparse information on the nuisance parameters you are better off annihilating the nuisance parameters rather than estimating them.

          What works best is highly dependent on the data type and interest parameter, very different for binary outcomes and I think unresolved for ratios.

          A simple case and its old solution is apparently naively rediscovered by Spanos here

          A classic reference on this topic is Barndorff-Nielsen, O., and Cox, D. R. Inference and asymptotics. Chapman and Hall, London, 1994 and I presented a summary of that material in this talk starting on page 16.

          One of the interesting claims in the Spanos paper is the claim “The fact that the ML method does not yield a consistent estimator of σ^2 should count in its favor not against it! To paraphrase Stigler’s quotation: the ML method ‘warns the modeler that the information in the data has been spread too thinly’.” as I had tried to come up with a general diagnostic in my thesis and was unable to succeed.

          So interesting to me but maybe a distraction to others.

          • Thanks; I’m reading up on this.

            In the Bayesian setting I don’t have to annihilate anything; the prior will dominate if there is sparse data. The problem with annihilating it is that—for statistical inference (is it significant or not?)—it gives you inaccurate standard error estimates because it ignores a source of variation. It’s not that lmer is entirely failing to estimate the variance component, it’s estimate is not going to be too accurate due to the relatively small number of items (15 in this particular case). So you will tend to declare significance where the data do not really allow you to. Of course, one could argue that significance is all nonsense anyway, your posterior estimate of the parameter will not change dramatically upon adding this small variance component. Fine; I am in agreement with that. But if you suspend disbelief for a second and agree that we are really going to go down the significance testing path, it seems to me that it does seem to matter whether one defines a model which reflects all sources of variance, or whether one aggregates over one source of variance and does a t-test or ANOVA. Happy to be corrected.

            • Keith O'Rourke says:

              OK so we are on the same chapter, if not page.

              > it ignores a source of variation

              Is it a relevant source of variation – especially in the context of it being so poorly estimated?
              In any Bayesian analysis, some conditioning is not done(e.g. various collected covariates in an randomised trial).

              > I don’t have to annihilate anything; the prior will dominate if there is sparse data
              If its the _right_ prior and hyper-parameter. On the other hand, in some problems the annihilating loses no information about the interest parameter, the likelihood remains a true likelihood and a prior for just the interest parameter gets you the posterior with no loss of information. As far as I understand you have just side stepped a modelling difficulty. Not sure what various Bayesians who do here and or whether this is referred to as Partial Bayes.

              I should re-read Andrew’s ANOVA paper, I remember not being fully convinced everything should always be cast into sources of variation rather than parameters in a data generating model…

              • Keith O'Rourke says:

                >If its the _right_ prior and hyper-parameter.
                Should have been _right_ prior and hyper-prior.

        • Ben Bolker says:

          OK, I get it. Thanks.

  2. question says:

    I do not mean to hijack this post, so please no one respond, I just want to suggest it as a future topic. Here is a use of a bayesian approach in a top-tier journal:
    Nuno R. Faria et al. The early spread and epidemic ignition of HIV-1 in human populations

    First, just because your model is consistent with the data does not make it the correct one:
    “We show that the HIV-1 group M pandemic ignited in Kinshasa”

    Second, looking quickly at the paper I see a bunch of model results (or at least it is unclear what is data and what is not). What is the best way to visually compare your posterior estimates to the actual data? For example in the caption of figure 4 it says:
    “Dots show regression data points”

  3. […] For more, go to my Anova article or, for something quicker, these old blog posts: – Anova for economists – A psychology researcher asks: Is Anova dead? – Anova is great—if you interpret it as a way of structuring a model, not if you focus on F test…. […]

Leave a Reply