Skip to content

fMRI clusterf******

Several people pointed me to this paper by Anders Eklund, Thomas Nichols, and Hans Knutsson, which begins:

Functional MRI (fMRI) is 25 years old, yet surprisingly its most common statistical methods have not been validated using real data. Here, we used resting-state fMRI data from 499 healthy controls to conduct 3 million task group analyses. Using this null data with different experimental designs, we estimate the incidence of significant results. In theory, we should find 5% false positives (for a significance threshold of 5%), but instead we found that the most common software packages for fMRI analysis (SPM, FSL, AFNI) can result in false-positive rates of up to 70%. These results question the validity of some 40,000 fMRI studies and may have a large impact on the interpretation of neuroimaging results.

I’m not a big fan of the whole false-positive, false-negative thing. In this particular case it makes sense because they’re actually working with null data, but ultimately what you’ll want to know is what’s happening to the estimates in the more realistic case that there are nonzero differences amidst the noise. The general message is clear, though: don’t trust FMRI p-values. And let me also point out that this is yet another case of a classical (non-Bayesian) method that is fatally assumption-based.

Perhaps what’s the most disturbing thing about this study is how unsurprising it all is. In one sense, it’s big big news: FMRI is a big part of science nowadays, and if it’s all being done wrong, that’s a problem. But, from another perspective, it’s no surprise at all: we’ve been hearing about “voodoo correlations” in FMRI for nearly a decade now, and I didn’t get much sense that the practitioners of this sort of study were doing much of anything to clean up their act. I pretty much don’t believe FMRI studies on the first try, any more than I believe “gay gene” studies or various other headline-of-the-week auto-science results.

What to do? Short-term, one can handle the problem of bad statistics by insisting on preregistered replication, thus treating traditional p-value-based studies as screening exercises. But that’s a seriously inefficient way to go: if you don’t watch out, your screening exercises are mostly noise, and then you’re wasting your effort with the first study, then again with the replication.

On the other hand, if preregistered replication becomes a requirement for a FMRI study to be taken seriously (I’m looking at you, PPNAS; I’m looking at you, Science and Nature and Cell; I’m looking at you, TED and NIH and NPR), then it won’t take long before researchers themselves realize they’re wasting their time.

The next step, once researchers learn to stop bashing their heads against the wall, will be better data collection and statistical analysis. When the motivation for spurious statistical significance goes away, there will be more motivation for serious science.

Something needs to be done, though. Right now the incentives are all wrong. Why not do a big-budget FMRI study? In many fields, this is necessary for you to be taken seriously. And it’s not like you’re spending your own money. Actually, it’s the opposite: at least within the university, when you raise money for a big-budget experiment, you’re loved, because the university makes money on the overhead. And as long as you close your eyes to the statistical problems and move so fast that you never have to see the failed replications, you can feel like a successful scientist.

The other thing that’s interesting is how this paper reflects divisions within PPNAS. On one hand you have editors such as Susan Fiske or Richard Nisbett who are deeply invested in the science-as-routine-discovery-through-p-values paradigm; on the other, you have editors such as Emery Brown (editor of this particular paper; full disclosure, I know Emery from grad school) who as a statistician has a more skeptical take and who has nothing to lose by pulling the house down.

Those guys at Harvard (but not in the statistics department!) will say, “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.” But they’re innumerate, and they’re wrong. Time for us to move on, time for the scientists to do more science and for the careerists to find new ways to play the game.

P.S. An economist writes in:

I wanted to provide a bit more context/background for your recent fMRI post. It went from a short comment to something much longer. Unfortunately, this is another time that a sensational headline misrepresents the actual content of the paper. I recently left academia and started a blog (among other things) but still have a few things far enough along that they might be published one day.


  1. Anonymous says:

    It was later discovered that this didn’t matter much at all. That is what you always find out about this p value stuff, it never actually matters after close inspection and all inference was actually made via other means, supposedly.

  2. Dear Andrew

    You might be interested in reading the reply (comment) on this from some of the developer of SPM (one of the most popular package for fMRI analysis)


  3. Also the blog entry from one of the authors of the original cluster failure paper (Thomas Nichols), might help to help the reader a more complete perspective on this

  4. Garnett says:

    But those pictures of glowing brains are so cool!

  5. Perhaps we should first get the assumptions about the nature of the signal right.
    Multi-scale, multi-resolution dependencies extending over many lags of time should probably not be analysed using off the shelf linear additive component models.

    Model-free analysis of brain fMRI data by recurrence quantification.

    Colored noise and computational inference in neurophysiological (fMRI) time series analysis: Resampling methods in time and wavelet domain;2-W/full

    Combining fMRI with EEG and MEG in order to relate patterns of brain activity to cognition

    Network hubs in the human brain

  6. Ibn says:

    fMRI is the reason I say science is heavily overfunded by states, contrary to the constant wailing. Entire lunatic social science departments could be financed from a fraction of the money wasted on one of these projects.

  7. Ariel Rokem says:

    Note that the 40k number is (ironically?) off by more than an order of magnitude (it’s probably closer to 3,500), as explained in the author’s correction to the original article:, and in this blog post:

  8. Angus says:

    Is this any different from the Dead Salmon Study?

  9. Jack Gallant says:

    The core of the problem is, of course, over-reliance on p values and the point-null hypothesis testing framework. This is clearly a problem in conventional fMRI, but it affects most of psychology and biology. Pre-registration of studies would probably reduce the Type I error rate in these studies, but it would leave the point-null hypothesis testing framework intact.
    Some fMRI researchers have made a more radical break with the conventional approach, relying instead on statistical methods that increase reliability and interpretability of results: collecting separate estimation and validation data sets, focusing on prediction accuracy instead of significance, fitting models to each subject individually before conducting further analysis at the group level, and moving away from discriminative models and toward generative models.
    The problem is that there are still many many people who were trained in the point-null hypothesis testing framework, and it will take a long time for the more modern approaches to filter through the community.

    • Anoneuoid says:

      >”collecting separate estimation and validation data sets, focusing on prediction accuracy”

      This would work great with pre-registration, as long as it is done right. The “validation” (I would call it “test”) dataset should be secret from the investigators until the last step. Then they get to run the model on it *once* to assess predictive skill. Basically, Kaggle does it right, that scheme should be mimicked by academia.

      Of course you will still get issues like human behaviour is non-stationary, etc but at least a big chunk of the overconfidence that goes on can be eliminated using a straight forward process. Is there the political will to purposefully limit the amount of hype that can be generated from overfitting to noise though? What is the current vibe in these fields? When I read a recent paper it is still most often the long-infamous* NHST + dynamite plots.

      *”one wonders whether the function of statistical techniques in the social sciences is not primarily to provide a machinery for producing phoney corroborations and thereby a semblance of ‘scientific progress’ where,– in fact, there is nothing but an increase in pseudo-intellectual garbage.” (pg 88-89)

Leave a Reply