A psychology researcher asks: Is Anova dead?

A research psychologist writes in with a question that’s so long that I’ll put my answer first, then put the question itself below the fold.

Here’s my reply:

As I wrote in my Anova paper and in my book with Jennifer Hill, I do think that multilevel models can completely replace Anova. At the same time, I think the central idea of Anova should persist in our understanding of these models. To me the central idea of Anova is not F-tests or p-values or sums of squares, but rather the idea of predicting an outcome based on factors with discrete levels, and understanding these factors using variance components.

The continuous or categorical response thing doesn’t really matter so much to me. I have no problem using a normal linear model for continuous outcomes (perhaps suitably transformed) and a logistic model for binary outcomes.

I don’t want to throw away interactions just because they’re not statistically significant. I’d rather partially pool them toward zero using an informative prior. Or, in the short term, set interactions to 0 if they help you understand the model, and use statistical significance as a guideline if you’d like, but in concert with your substantive goals. If a certain interaction is something you’re just including to correct for potential imbalance between groups, and it’s not statistically significant, maybe you can toss it. But if it’s central to your understanding, keep it in, while recognizing that you will have a lot of uncertainty in coefficients and comparisons that arise from that interaction of factors.

Regarding your conceptual point, yes yes yes yes yes I agree that you should use those continuous variables, don’t chop them up as binary, that would just throw away info. If you _must_ make a variable binary, please break it into 3 categories and compare high to low (see my paper with David Park on splitting a predictor at the upper quarter or third and the lower quarter or third).

And now here’s the question:

Recently, there has been a shift in field away from ANOVA to the use of mixed effects logit models. It was primarily based on the advice in this paper: Jeager, T. F. (2008) Categorical Data Analysis: Away from ANOVAs (transformation or not) and towards Logit Mixed Models. Journal of Memory and Language, 59: 434–446 . (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2613284/)

It is becoming the gold standard technique in the field, but I [the psychologist who’s asking this question] am having some issues understanding it. Learning to program it is relatively easy, learning how to use it appropriately, and especially, understanding how to interpret logit models is much harder. And I have overheard too many discussions about interactions amongst my poli sci and economist friends, especially in logit models, to not be somewhat sceptical of the advice in said paper. So I took to reading, and have ended up more confused than I started. Basically, given some of the issues, I am not sure that it is worth the switch. But my priors lead me to believe that the author of the paper knows much more about stats than I do, and given that, that I’m confused. Hopefully this will be intriguing enough to you to respond.

Here’s what I’m having trouble with.
The main impetus for the shift away from ANOVA to logit is two-fold: 1) arguing that we actually have categorical response data, and 2) a demonstration of a spurious interaction effect in ANOVA – as in, it’s significant in ANOVA (even using transformed data) but not in the logit model. I will deal with the latter first, since it is a statistical issue, whereas I see the former as conceptual, and I can see arguments for both sides.

As far as I can tell, the interpretation of interactions in logit is very tricky. This point is made by Golder and colleagues (https://files.nyu.edu/mrg217/public/presentation_interaction.pdf), among others. Moreover, it depends on the type of variables considered in the interaction, i.e., continuous vs. categorical (http://www.ats.ucla.edu/stat/stata/seminars/interaction_sem/interaction_sem.htm). Given all the complications, I am loathe to throw away a result because it was not significant in a logit model. Basically, logit results are being treated as ANOVA results (just look at the p value and you know all you need to know), but getting rid of a problem and with bonus information (basically, effect sizes, in this case, the odds ratios, although, they are not reported in the papers I’ve read using logit). But according to Golder and colleagues “ the coefficient and standard error on the interaction term does not tell us the direction, magnitude, or significance of the ‘interaction effect’” . Or some folks at UCLA (http://www.ats.ucla.edu/stat/stata/seminars/interaction_sem/interaction_sem.htm ) “Just because the interaction term is significant in the log odds model, it doesn’t mean that the probability difference in differences will be significant for values of the covariate of interest. Paradoxically, even if the interaction term is not significant in the log odds model, the probability difference in differences may be significant for some values of the covariate.” Or Berry, DeMeritt, Esarey (2010, AJPS vol 54). Or in Ai et al., (2003)” In probit or logistic regressions, one can not base statistical inferences based on simply looking at the co-efficient and statistical significance of the interaction terms. So reading off the p values for an interaction term is not a straightforward matter, or should I say, using them to directly reject the hypothesis that there is an interaction is not the same as in an ANOVA.

This is not to say that using untransformed data is unproblematic. But there are other transformations that can, I think, deal with the problem raised, as long as one is willing to think about the data as more or less continuous, rather than binary. (I am not very sure about this however, hence the “I think”, http://oak.ucc.nau.edu/rh232/courses/EPS625/Handouts/Data%20Transformation%20Handout.pdf).

This brings me to my more conceptual point. Often, we are interested in an underlying variable that is not binary, rather, it varies along some dimension, e.g. strength of a memory, effectiveness of learning given different study or teaching methods, things of that sort. And by probing a person’s knowledge multiple times, we hope to have an estimate of that underlying variable. This same point is made by Long (1997). So if we have an estimate that is closer to being continuous (it will always be somewhat constrained by the number of times we ask people what is essentially the same question), doesn’t it make sense to use it? Often, researchers are restricted to binary response variables as measures, and that makes their lives more complicated. So if we have data more closely related to what we are interested in (i.e., the person’s overall performance), why not use it? We do not care about particular responses, we care about overall patterns in responses, and we have those. Example: I want to know if one learning method is better than another. I have two groups of people learn under different conditions. I test their eventual knowledge. I now have their responses on individual test items, and their overall performance. Since I care about their overall performance, why would I use an approximation, or put differently, a single sample of their performance, to test whether learning methods affect overall performance. Moreover, it gets rid of including the random variation in an individual’s performance on an item. One might argue that it does not account for the consistency in a person’s performance over items, but that seems to be misguided. (And if you have a repeated measures design where the same person is included in different conditions, then you will have some estimate of that part of the total variance.) However,, I can see reasons for doing both. E.g., if I were interested in whether increased studying led to better performance on a test, I would want to use overall performance. If I were instead interested in whether my individual questions were fair (meaning, they each reflected the relationship between studying and better performance) then I would most definitely want to include data from each question individually. But that’s a different question.

The points just mentioned are about whether the switch (to logit) is really necessary (Clearly, if the data have to be considered as binary, then linear regression isn’t appropriate). The following points are about implementation and recommendations as I understand them. If I’m going to have to use logit, I want to do it right.
Fixed or random effects. I was both heartened and disheartened to see your posting on random vs fixed effects. I have been trying to figure out why things like participant are being treated as random effects, and can’t: There is no discussion in my field (as far as I can tell anyway) about whether predictor variables should be treated as fixed or random, although I see that there are ways of deciding based on the nature of the error (rather than any a priori assumptions). I ask because it seems as if the decision is not trivial. My understanding of the difference (from the perspective of assumptions) is that random effects are more efficient but biased, and that in other disciplines the choice of a random effects model would have to be tested and justified. Moreover, they fail to account for what they are supposed to account for, error that is consistently attributed to an individual and is associated with that person in each and every measurement taken by that person. So while they make the model more efficient, they are also less conservative. They also have different interpretations (http://www.cscu.cornell.edu/news/statnews/stnews76.pdf). Given that we psychologists are typically actually interested in the effect in general, not simply with respect to the people we are testing, according to the site(sic) just referenced, it would seem that individuals should be treated as fixed effects for logical reasons as well. The push to use mixed effects models has been predicated on ‘the fact that ordinary logit models provide no direct way to model random subject and item effects’. But if the error attributed consistently to individuals can be handled as fixed effects, and this is a less biased model, then it would be preferred. Or am I missing something.

Multicollinearity: One of two approaches to multicollinearity are being advocated (more elsewhere than in the article: 1) drop a variable, or 2) center the variables. The first approach seems to me to be somewhat suspect as it introduces bias into the model. (I guess one could argue that if you had never thought about that variable and never included it, the same bias would exist. But since we are supposedly testing our hypotheses about how something works (your theory is built into the design of the experiment – why include an IV if you don’t think it does something), not including a variable in your model because of multicollinearity seems like knowingly (even in not intentionally) introducing bias that could make your other predictor variables seem more important than they really are, and thereby artificially inflating the evidence for other parts of your theory of the phenomenon. But given what I just talked about, random vs fixed effects, bias doesn’t seem to be too much of a concern…) My reluctance seems to be supported by Kennedy (1998). The second solution is argued against by Golder and colleagues (https://files.nyu.edu/mrg217/public/presentation_interaction.pdf ) as a) trying to get rid of part of the error that is seen as an overestimation, but in reality is not, and b) not doing what people think because it doesn’t add any additional information (which is the real solution).

I should be clear that all of this is not to say that I think that ANOVA is perfect (I have seen, but I will admit not yet carefully read your paper on ANOVA, but it seems to be more about complicated designs than I typically use so I may be fine with the standard version. But, even though I use SPSS, I can program in it – I learned to use it back in the SPSS for DOS days – so using an improved ANOVA model is something I could do with some work). But I know how to interpret things in ANOVA. Unlike what seems to be the case for practitioners of regression (from what I gleaned from a presentation and paper by Golder and colleagues), I was taught to be careful interpreting main effects given a significant interaction in an ANOVA. Basically, I was taught to not interpret them, but instead to separately analyse the effect of one DV at each level of the other separately. So moving to something that is difficult to interpret causes me some trepidation. Regression clearly has some benefits, in particular co-efficients, but I am unconvinced that logit is the way to go. But if not logit (or something like it), then people like me will run into serious issues with ns – many of our experiments just don’t have enough participants to include many variables in a linear regression (which I can also interpret) with one or two observations per person (given repeated measures designs, we can have e.g. two different percentages for instance from the same person, but usually not too many more than that).

(Not to mention that adding and removing variables from an analysis based on results is something that drives me batty. As an experimentalist, my view is that you have a theory about how things work, that drives your experimental design, and you should include all the variables you included in your experiment in your statistical model. Even if something does not explain a significant amount of the variation, leaving it out will produce a biased estimate of the contributions of the other variables. As experimentalists we are not in the business of finding the model that best fits our data, we are in the business of testing our theories. That can be done just as well, possibly better, with regression techniques as compared to ANOVA, but the temptation to fiddle with models and see what’s best is contrary to the logic of running a well-controlled experiment. But that’s a bit of an aside. Not a problem with the technique, just how it seems to be being used.)

18 thoughts on “A psychology researcher asks: Is Anova dead?

  1. Hey Andrew on breaking up the continuous variables. This is separate from the route of quantile regressions. What are your thoughts on those? They seem to be a good indicator of the distribution of responses and the heterogeneity of responses within the the distribution. Will we see some replacement of the Average Treatment Effect with these types of regressions. Or do you feel that ATE trumps the quantile effects permanently?

  2. Pingback: In statistics, is it possible that ANOVA could go obsolete? - Quora

  3. Pingback: A psychology researcher asks: Is Anova dead? « Statistical … | Social Fobi - Det Du Behöver Veta

  4. What does centering (if s/he means “subtract out the mean”) have to do with multicollinearity? Centering doesn’t affect multicollinearity at all! Obviously it doesn’t…if X and Y are (linear, Pearson product motion) correlated, and then you subtract a constant factor from X and/or from Y, they’re still just as correlated as they used to be. Some saner approaches to multicollinearity are residualization, summing together related variables, and using orthogonal predictors.I’m no expert, so surely Andrew can suggest more…

    A small correction: that paper is by Jaeger, not Jeager (Jäger is “hunter” [also elite soldier] in German, and as is common with German proper names, diaeresis on the vowel is replaced by the vowel-e to indicate the presence of umlaut).

    • Centering is indeed helpful to reduce correlation between predictors and their interactions (see e.g. Cohen, Cohen, West and Aiken, or Aiken and West). It does of course not change correlations between predictors, only between the predictors and the higher order terms (e.g. interactions or quadratic terms).
      To witness:

      x1 <- rnorm(100,mean=5,sd=1)
      x2 <- rnorm(100,mean=8,sd=1.5)
      int <- x1*x2
      y <- .3*x1 + .3*x2 + .05*int + rnorm(100,0,1)

      cx1 <- scale(x1,center=TRUE)
      cx2 <- scale(x2,center=TRUE)
      cint <- cx1*cx2

      df <- data.frame(cbind(x1,x2,int,cx1,cx2,cint,y))
      names(df) <- c("x1", "x2", "int", "cx1", "cx2", "cint", "y")
      cor(df)

      fit <- lm(y~x1+x2+int)
      cfit <- lm(y~cx1+cx2+cint)

      vif(fit)
      vif(cfit)

      Both correlations among predictor terms and their interactions and the variance inflation factor in the regression drop dramatically when using centered variables. The true advantage of centering however lies in the ease of interpretation of the regression coefficients.

      • Centering can reduce the VIF, but it does so at the expense of variance in the interaction. This exactly offsets any gain made by reducing multicollinearity. While the coefficients and standard errors of the lower order terms change, it is only because they are now estimating different quantities. The interactive term is estimating the same quantity and its coefficient and standard error are unchanged. If I can reduce the VIF, but pick up no increased certainty because I’ve lost variance, what have I gained by centering (aside from easier interpretations and computations)?

        • > n e1 x1 x2 x1.c x2.c cor(x1, x1*x2)
          [1] 0.8114442
          > cor(x1.c, x1.c*x2.c)
          [1] -0.3284119
          >
          > library(arm)
          > y m1 m2 display(m1)
          lm(formula = y ~ x1 * x2)
          coef.est coef.se
          (Intercept) 0.66 1.19
          x1 0.88 0.24
          x2 0.87 0.28
          x1:x2 1.03 0.05

          n = 100, k = 4
          residual sd = 1.19, R-Squared = 0.99
          > display(m2)
          lm(formula = y ~ x1.c * x2.c)
          coef.est coef.se
          (Intercept) 35.03 0.13
          x1.c 5.93 0.10
          x2.c 6.10 0.09
          x1.c:x2.c 1.03 0.05

          n = 100, k = 4
          residual sd = 1.19, R-Squared = 0.99
          >

          Despite the estimates and uncertainty not changing at all between the centered and uncentered models, the multicollinearity is an extreme problem in the first model and not at all a problem in the second, according to the VIF. This is because centering reduces covariance (and therefore variance). The two exactly offset. It just so happens that multicollinearity was given a name and deemed a problem, while most people never think about how to increase variance.

          > library(HH)
          > vif(m1)
          x1 x2 x1:x2
          8.832663 12.629053 28.654592
          > vif(m2)
          x1.c x2.c x1.c:x2.c
          1.423671 1.291683 1.122040

        • I agree that the uncertainty is not decreased through centering (and of course not increased either) and that the massive reduction in VIF is in a sense an artifact, however the interpretation of the centered model is usually much easier to interpret and closer to meaningful parameter estimates that may be of substantive interest to the applied researcher. Centering around other values than the mean (i.e. Johnson-Neyman procedure) can then yield other conditional effects of interest.
          So in essence centering doesn’t hurt you in terms of uncertainty, but makes interpretation easier.
          Your earlier post seems to suggest that you think that centering is actually harmful because it reduces variance in the interaction term – or am I misreading you?

        • To be clear, I am not at all saying that centering is harmful. It might be helpful for interpretation and usually improves computation. However, it does not reduce the uncertainty of estimates.

    • Sorry mistyped and z-standardized instead of centered…

      cx1 <- scale(x1,center=TRUE, scale=FALSE)
      cx2 <- scale(x2,center=TRUE, scale=FALSE)
      cint <- cx1*cx2

      Conclusions regarding standard error and VIF still hold.

  5. Re “losing interactions”: Surely, a significant interaction in an ANOVA, that disappears in the logit model (given a truly binary DV), is merely an artefact of mis-specification.

    This may be putting too much weight on the assumption that the logit specification is correct, but at least it cannot be less correct than the linear specification.

  6. Andrew… that’s a lot of questions from that psychologist. I hope that you encouraged him to post his letter in a good open forum with statistics people to read it. While your blog gets lots of activity I think it would be great if it were but on a stats discussion forum.

    Kyle, I know it seems like centring shouldn’t remove multicollinearity, and it doesn’t in an additive model, but keep in mind that when there’s an interaction it’s not just the two linear effects. Try it yourself. Generate two normal random variables with substantially different means and variances that are highly correlated. Then generate a third that is the first two multiplied. The correlation between either of the initial variables and the third should be quite high. Now, create a new variable that is the product of the first two centred (by subtracting the mean). The correlation between that new variable and either of the initial ones is much lower (probably close to 0).

    • Most “solutions” to the “problem” of multicollinearity work by getting rid of covariance between explanatory variables. This consequently reduces the variance of explanatory variables, which makes estimates less precise. The two problems exactly offset. It just so happens that most people see substantial covariation as a problem but don’t mind having little variation in their variables.

  7. Pingback: Variance, Covariance, and Interaction Terms | Bootstrapping Life

Comments are closed.