More on the voodoo correlations in neuroscience

Ed Vul, Christine Harris, Piotr Winkielman, and Harold Pashler wrote an article where:

1. They point out that correlations reported in FMRI medical imaging studies are commonly overstated because researchers tend to report only the highest correlations, or only those correlations that exceed some threshold.

2. They suggest that these statistical problems are leading researchers, and the general public, to overstate the connections between social behaviors and specific brain patterns.

After posting on this article, I received a bunch of comments and questions as well as some responses:

This article by Jabbi, Keysers, Singer, and Stephan argues that, because brain imaging resesarchers adjust their p-values and significance thresholds for multiple comparisons (the thousands of voxels in a brain image), their statistical methods don’t have the problems that Vul et al. claimed.

This reply by Vul to the Jabbi et al. article. Here Vul argues that adjustment of significance levels does not stop the selected correlations themselves from being too high. I found Vul’s argument here to be convincing. Multiple comparisons methods control the rate of false alarms in a setting where true effects are zero–but I don’t see that to be relevant to the imaging setting, where differences are not in fact zero. Lots of things affect blood flow in the brain, and we would never expect the average scans of two different groups of people to be the same.

This article by Lieberman, Berkman, and Wager, who defend social neuroscience and argue the following:

1. They accept Vul et al.’s point 1 above (correlations are overstated) but present some evidence that the correlations aren’t as overstated as Vul et al. might fear.

2. They disagree with the implied claim that the overstated correlations have distorted scientists’ understanding of social neuroscience research.

3. They object to Vul et al’s focusing on social neuroscience, given that the same statistical issues arise in all sorts of brain imaging studies.

4. They point out some specific areas where Vul et al. mischaracterized the data-analytic methods used in this field.

I think Lieberman et al. make some good points, but, as Vul et al. point out, researchers often do use correlations to summarize their results. And, even if said correlations survived a multiple-comparisons analysis, readers might interpret these at face value without understanding the selection issue. So all this shake-out is probably a good thing, especially where correlation estimates are being compared to each other.

My thoughts

First off, I haven’t worked seriously in medical imaging for nearly 20 years and have only one published paper in the area, so my comments are mostly informed by my perspective on general statistical issues, as well as my own experience thinking about estimation of effect sizes in studies with low statistical power.

Regarding the singling-out of social neuroscience, I see the point of Lieberman et al. I was thinking that maybe one reason for this is that in social neuroscience it’s perhaps more difficult to get external validation in the way that might be more possible in other areas of neuroscience where there is some measurement in the blood or whatever that can be taken. I’m not sure about this, just a conjecture.

It’s hard for me to believe that the approach based on separate analyses of voxels and p-values, is really the best way to go. The null hypothesis of zero correlations isn’t so interesting. What’s really of interest is the pattern of where the differences are in the brain.

Related to this point is that, ultimately, when trying to understand differences in brain processing between different sorts of people (or between people doing different tasks), the maximum correlation among voxels is ultimately not what you’re looking for. That is why researchers summarize using regions of interest (as in p.7 of the Lieberman et al. article). Vul et al. were correct to warn about overinterpretation of correlations that have been selected as the maximum: the naive reader can see such correlations (and accompanying scatterplots) to think that certain personality traits are more predictable from brain scans than they actually are.

I think the way forward will be to go beyond correlations and the horrible multiple-comparisons framework, which causes so much confusion. Vul et al. and Lieberman et al. both point out that classical multiple comparisons adjustments do not eliminate the systematic overstatement of correlations. A hierarchical Bayes approach (using some sort of mixture for the population of pixel differences, ideally modeled hierarchically with pixels grouped within regions of interest) would help here..

And now for some amateur psychologizing (unsupported by any statistical analysis, correlational or other)

I suspect that one of the motivations of Vul et al in writing their article was frustration at too-good-to-be-true numbers which they felt led to exaggerated claims of neuro super-science.

Conversely, I suspect one of the frustrations of Lieberman et al. is that they are doing a lot more than correlations and fishing expeditions–they’re running experiments to test theories in psychology, they’re trying to synthesize results from many different labs. And from that perspective it must be frustrating for them to see a criticism (featured in the popular press) that is so focused on correlation, which is really the least of their concerns.

It also seems that both sides were irritated by what they saw as giddy press coverage: on one side, claims of dramatic breakthroughs in understanding the biological basis of behavior and personality; on the other, claims of a dramatic Emperor-has-no-clothes debunking. As scientists, most of us welcome press coverage–after all, we think this work is important and we’d like others to know about it–but . . . fawning press coverage of something that we think is wrong–that’s just annoying.

P.S. Wager is a friend–he teaches in the psychology department here–but I don’t think my personal knowledge has hindered my evaluation here.

P.P.S. I ran the above by various people involved and they gave some helpful clarifications. But I’ve probably left in a couple of sloppy statements here and there.

6 thoughts on “More on the voodoo correlations in neuroscience

  1. Researchers doing neuroimaging are usually aware of the problem of multiple comparisons and try to correct for it. Nevertheless, in a count of the types of p-values present in a sample of neuroimaging papers (from my "Brede Database") I found that far from all results in neuroimaging are based on corrected p-values.

    Apart from this issue I "feel" that even with corrected p-values there is quite a number of "strange" results in the typical neuroimaging paper. Whether this is due to the limited number of subjects typically involved in a study, a too wrong statistical model or a general problem with measuring such an elusive object as the brain I do not know.

  2. Prof. Gelman,

    the problem of "voodoo correlations" in a broad sense, affects many disciplines, not only fmri studies.

    Would you like to see more "voodoo correlations" papers to stimulate the debate?


  3. An excellent discussion of this paper!

    One thing which I think we need to bear in mind is that Vul et al's argument actually relies upon there being a statistically significant correlation in the first place. The problem arises if you then pick out those voxels in which the correlation passes a threshold, and report on the average correlation in those voxels. That average will, by definition, be high because it has to exceed the threshold for statistical significance (generally quite conservative given that it is corrected for multiple comparisons).

    However Vul et al also include a second, different argument (see Page 18-19 and footnote #17) which accuses an unspecified number of papers of a serious failure of multiple-comparisons testing based on a misreading of an 1995 stats paper…

    However, unless a paper falls prey to this (or a similar) error, there is no reason to believe that a given correlation is entirely voodoo. The magnitude may be inflated, but not from zero.

    I discuss this here.

  4. The response is reasonable. They don't say much about the comment that Martin Lindquist and I wrote, which makes sense, given that we pretty much are in agreement with their original article. The only thing in the rejoinder that I really disagree with is their emphasis on cross-validation as a way of estimating correlations without selection issues. I think a hierarchical Bayesian approach has much more potential here. Cross-validation is crude and does not make use of the hierarchical structure of the problem.

Comments are closed.