Nick Brown and I recently published this paper, How statistical challenges and misreadings of the literature combineto produce unreplicable science: An example from psychology, in the journal, Advances in Methods and Practices in Psychological Science, about a published article that claimed to find that healing of bruises could be sped or slowed by manipulating people’s subjective sense of time. The article under discussion had major problems–including weak theory, a flawed data analysis, and multiple errors in summarizing the literature–to the extent that we did not find their claims to be supported by their evidence.
That sort of thing happens–we call it everyday, bread-and-butter pseudoscience. The point of the new paper by Nick and me was not so much to shoot down one particular claim of mind-body healing but rather to explore some general reasons for scientific error and overconfidence and to provide a template for future researchers to avoid these problems in the future.
We posted this the other day on the blog (under the heading, 7 steps to junk science that can achieve worldly success), and something came up in comments, something I hadn’t thought of before and I wanted to share with you. The commenter, Jean-Paul, pointed to a misunderstanding that can arise when fitting multilevel models using some software.
Need for multilevel modeling or some equivalent adjustment for clustering in data
Here was the problem. The article in question estimated treatment effects with a cluster design: they had three treatments applied to 33 research participants (in psychology jargon, “subjects”) with data provided by 25 raters. The total number of measurements was 2425 (not quite 3*33*25 because of some missing data, which is not the point in this story).
When you want to estimate treatment effects from a clustered design, you need to fit a multilevel model or make some equivalent adjustment. Otherwise your standard errors will be too small. In this case you should include effects for participants (some have more prominent bruises than others in this study) and for raters (who will have systematic differences in how they characterize the severity of a bruise using a numerical rating).
The researchers knew to do this, so they fit a multilevel model. Fortunately for me, they used the R program lmer, which I’m familiar with. Here’s the output, which for clarity I show using the display function in the arm package:
lmer(formula = Healing ~ Condition + (1 | Subject) + (1 | ResponseId), data = DFmodel, REML = FALSE) coef.est coef.se (Intercept) 6.20 0.31 Condition28 0.23 0.09 Condition56 1.05 0.09 Error terms: Groups Name Std.Dev. Subject (Intercept) 1.07 ResponseId (Intercept) 1.22 Residual 1.87 --- number of obs: 2425, groups: Subject, 33; ResponseId, 25 AIC = 10131.4, DIC = 10119.4 deviance = 10119.4
So, fine. As expected, participant and rater effects are large–they vary by more than the magnitudes of the estimated treatment effects!–so it’s a good thing we adjusted for them. And the main effects are a stunning 2.4 and 11.1 standard errors away from zero, a big win!
Need for varying slopes, not just varying intercepts
But . . . wait a minute. The point of this analysis is not just to estimate average responses; it’s to estimate treatment effects. And, for that, we need to allow the treatment effects to vary by participant and rater–in multilevel modeling terms, we should be allowing slopes as well as intercepts to vary. This is in accordance with the well-known general principle of the design and analysis of experiments that the error term for any comparison should be at the level of analysis of the comparison.
No problem, lmer can do that, and we do so! here’s the result:
lmer(formula = Healing ~ Condition + (1 + Condition | Subject) + (1 + Condition | ResponseId), data = DFmodel, REML = FALSE) coef.est coef.se (Intercept) 6.18 0.39 Condition28 0.25 0.36 Condition56 1.09 0.37 Error terms: Groups Name Std.Dev. Corr Subject (Intercept) 1.71 Condition28 1.99 -0.71 Condition56 2.03 -0.72 0.65 ResponseId (Intercept) 1.24 Condition28 0.07 1.00 Condition56 0.13 -1.00 -1.00 Residual 1.51 --- number of obs: 2425, groups: Subject, 33; ResponseId, 25 AIC = 9317.9, DIC = 9285.9 deviance = 9285.9
The estimated average treatment effects have barely changed, but the standard errors are much bigger. The fitted models shows that treatment effects vary a lot by participant, not so much by rater.
You might be happy because the t-statistic for one of the treatment comparisons is still 3 standard errors away from zero, but other problems remain, both with multiple comparisons and with the interpretation of the data, as we discuss in section 2.4 of our paper.
There’s one thing you might notice if you look carefully at the above output, which is that the estimated covariance matrix for the rater effects is degenerate: the correlations are at 1 and -1. This sort of thing happens a lot with multilevel models when data are noisy and the number of groups is small. It’s an issue we discussed in this paper from 2014.
In practice, it’s no big deal here: the degeneracy is a consequence of a very noisy estimate, which itself arises because the noise in the data is much larger than the signal, which itself arises because the variation in treatment effects across raters is indistinguishable from zero (look at those small estimated standard deviations of the varying slopes for rater). But that’s all pretty subtle, not something that’s in any textbook, even ours!
That scary warning on the computer
Here’s the problem. When you fit that varying slopes model, R displays a warning in scary red type:
It’s possible that the researchers tried this varying-slope fit, saw the singularity, got scared, and retreated to the simpler varying-intercept model, which had the incidental benefit of giving them an estimated treatment effect that was 11 standard error from zero.
Just to be clear, I’m not saying this is a problem with R, or with lme4. Indeed, if you type help(“isSingular”) in the R console, you’ll get some reasonable advice, none of which is to throw out all the varying slopes. But the first option suggested there is to “avoid fitting overly complex models in the first place,” which could be misunderstood by users.
If you fit the model using stan_lmer on its default settings, everything works fine:
family: gaussian [identity] formula: Healing ~ Condition + (1 + Condition | Subject) + (1 + Condition | ResponseId) observations: 2425 ------ Median MAD_SD (Intercept) 6.22 0.40 Condition28 0.26 0.34 Condition56 1.09 0.37 Auxiliary parameter(s): Median MAD_SD sigma 1.51 0.02 Error terms: Groups Name Std.Dev. Corr Subject (Intercept) 1.731 Condition28 2.015 -0.63 Condition56 2.056 -0.66 0.57 ResponseId (Intercept) 1.225 Condition28 0.158 0.33 Condition56 0.172 -0.51 0.04 Residual 1.509 Num. levels: Subject 33, ResponseId 25
These estimates average over the posterior distribution, so nothing on the boundary, no problem. And, no surprise, the estimates and standard errors of the treatment effects are basically unchanged.
That said, when you run stan_lmer, it gives warnings in red too!
These warning messages are a big problem! On one hand, yeah, you want to warn people; on the other hand, it would be just horrible if the warnings are so scary that they send users to use bad models that don’t happen to spit out warnings.
I also tried fitting the model using blmer, which I thought would work fine, but that produced some warning messages that scared even me:
By default, blme uses a degeneracy-avoiding prior, so this sort of thing shouldn’t happen at all. We should figure out what’s going on here!
Summary
1. If you are estimating treatment effects using clustered data, you should be fitting a multilevel model with varying intercepts and slopes (or do the equivalent adjustment in some other way, for example by boostrapping according to the cluster structure). Varying intercepts is not enough. If you do it wrong, your standard errors can be way off. This is all consistent with the general principle to use all design information in the analysis.
2. If you apply a more complicated method on the computer and it gives you a warning message, this does not mean that you should go back to a simpler method. More complicated models can be harder to fit. This is something you might have to deal with. If it really bothers you, then be more careful in your data collection; you can design studies that can be analyzed more simply. Or you can do some averaging of your data, which might lose some statistical efficiency but will allow you to use simpler statistical methods (for example, see section 2.3, “Simple paired-comparisons analysis,” of our paper).