Ed Vul, Christine Harris, Piotr Winkielman, and Harold Pashler wrote an article where:
1. They point out that correlations reported in FMRI medical imaging studies are commonly overstated because researchers tend to report only the highest correlations, or only those correlations that exceed some threshold.
2. They suggest that these statistical problems are leading researchers, and the general public, to overstate the connections between social behaviors and specific brain patterns.
After posting on this article, I received a bunch of comments and questions as well as some responses:
This article by Jabbi, Keysers, Singer, and Stephan argues that, because brain imaging resesarchers adjust their p-values and significance thresholds for multiple comparisons (the thousands of voxels in a brain image), their statistical methods don’t have the problems that Vul et al. claimed.
This reply by Vul to the Jabbi et al. article. Here Vul argues that adjustment of significance levels does not stop the selected correlations themselves from being too high. I found Vul’s argument here to be convincing. Multiple comparisons methods control the rate of false alarms in a setting where true effects are zero–but I don’t see that to be relevant to the imaging setting, where differences are not in fact zero. Lots of things affect blood flow in the brain, and we would never expect the average scans of two different groups of people to be the same.
This article by Lieberman, Berkman, and Wager, who defend social neuroscience and argue the following:
1. They accept Vul et al.’s point 1 above (correlations are overstated) but present some evidence that the correlations aren’t as overstated as Vul et al. might fear.
2. They disagree with the implied claim that the overstated correlations have distorted scientists’ understanding of social neuroscience research.
3. They object to Vul et al’s focusing on social neuroscience, given that the same statistical issues arise in all sorts of brain imaging studies.
4. They point out some specific areas where Vul et al. mischaracterized the data-analytic methods used in this field.
I think Lieberman et al. make some good points, but, as Vul et al. point out, researchers often do use correlations to summarize their results. And, even if said correlations survived a multiple-comparisons analysis, readers might interpret these at face value without understanding the selection issue. So all this shake-out is probably a good thing, especially where correlation estimates are being compared to each other.
First off, I haven’t worked seriously in medical imaging for nearly 20 years and have only one published paper in the area, so my comments are mostly informed by my perspective on general statistical issues, as well as my own experience thinking about estimation of effect sizes in studies with low statistical power.
Regarding the singling-out of social neuroscience, I see the point of Lieberman et al. I was thinking that maybe one reason for this is that in social neuroscience it’s perhaps more difficult to get external validation in the way that might be more possible in other areas of neuroscience where there is some measurement in the blood or whatever that can be taken. I’m not sure about this, just a conjecture.
It’s hard for me to believe that the approach based on separate analyses of voxels and p-values, is really the best way to go. The null hypothesis of zero correlations isn’t so interesting. What’s really of interest is the pattern of where the differences are in the brain.
Related to this point is that, ultimately, when trying to understand differences in brain processing between different sorts of people (or between people doing different tasks), the maximum correlation among voxels is ultimately not what you’re looking for. That is why researchers summarize using regions of interest (as in p.7 of the Lieberman et al. article). Vul et al. were correct to warn about overinterpretation of correlations that have been selected as the maximum: the naive reader can see such correlations (and accompanying scatterplots) to think that certain personality traits are more predictable from brain scans than they actually are.
I think the way forward will be to go beyond correlations and the horrible multiple-comparisons framework, which causes so much confusion. Vul et al. and Lieberman et al. both point out that classical multiple comparisons adjustments do not eliminate the systematic overstatement of correlations. A hierarchical Bayes approach (using some sort of mixture for the population of pixel differences, ideally modeled hierarchically with pixels grouped within regions of interest) would help here..
And now for some amateur psychologizing (unsupported by any statistical analysis, correlational or other)
I suspect that one of the motivations of Vul et al in writing their article was frustration at too-good-to-be-true numbers which they felt led to exaggerated claims of neuro super-science.
Conversely, I suspect one of the frustrations of Lieberman et al. is that they are doing a lot more than correlations and fishing expeditions–they’re running experiments to test theories in psychology, they’re trying to synthesize results from many different labs. And from that perspective it must be frustrating for them to see a criticism (featured in the popular press) that is so focused on correlation, which is really the least of their concerns.
It also seems that both sides were irritated by what they saw as giddy press coverage: on one side, claims of dramatic breakthroughs in understanding the biological basis of behavior and personality; on the other, claims of a dramatic Emperor-has-no-clothes debunking. As scientists, most of us welcome press coverage–after all, we think this work is important and we’d like others to know about it–but . . . fawning press coverage of something that we think is wrong–that’s just annoying.
P.S. Wager is a friend–he teaches in the psychology department here–but I don’t think my personal knowledge has hindered my evaluation here.
P.P.S. I ran the above by various people involved and they gave some helpful clarifications. But I’ve probably left in a couple of sloppy statements here and there.