Skip to content

“The Null Hypothesis Screening Fallacy”?

[non-cat picture]

Rick Gerkin writes:

A few months ago you posted your list of blog posts in draft stage and I noticed that “Humans Can Discriminate More than 1 Trillion Olfactory Stimuli. Not.” was still on that list. It was about some concerns I had about a paper in Science ( After talking it through with them, the authors of that paper eventually added a correction to the article. I think the issues with that paper are a bit deeper (as I published elsewhere: but still it takes courage to acknowledge the merit of the concerns and write a correction.

Meanwhile, two of the principal investigators from that paper produced a new, exciting data set which was used for a Kaggle-like competition. I won that competition and became a co-first author on a *new* paper in Science (

And this is great! I totally respect them as scientists and think their research is really cool. They made an important mistake in their paper and since the research question was something I care a lot about I had to call attention to it. But I always looked forward to moving on from that and working on the other paper with them, and it all worked out.

That is such a great attitude.

Gerkin continues:

Yet another lesson that most scientific disputes are pretty minor, and working together with the people you disagreed with can produce huge returns. The second paper would have been less interesting and important if we hadn’t been working on it together.

What a wonderful story!

Here’s the background. I received the following email from Gerkin a bit over a year ago:

About 3 months ago there was a paper in Science entitled “Humans Can Discriminate More than 1 Trillion Olfactory Stimuli” ( You may have heard about it through normal science channels, or NPR, or the news. The press release was everywhere. It was a big deal because the conclusion that humans can discriminate a trillion odors was unexpected, previous estimates having been in the ~10000 range. Our central concern is the analysis of the data.

The short version:
They use a hypothesis testing framework — not to reject a null hypothesis with type 1 error rate alpha — but to essentially convert raw data (fraction of subjects discriminating correctly) into a more favorable form (fraction of subjects discriminating significantly above chance), which is subsequently used to estimate an intermediate hypothetical variable, which, when plugged into another equation produces the final point estimate of “number of odors humans can discriminate”. However, small changes in the choice of alpha during this data conversion step (or equivalently small changes in the number of subjects, the number of trials, etc), by virtue of their highly non-linear impact on that point estimate, undermine any confidence in that estimate. I’m pretty sure this is a misuse of hypothesis testing. Does this have a name? Gelman’s fallacy?

I replied:

People do use hyp testing as a screen. When this is done, it should be evaluated as such. The p-values themselves are not so important, you just have to consider the screening as a data-based rule and evaluate its statistical properties. Personally, I do not like hyp-test-based screening rules: I think it makes more sense to consider screening as a goal and go from there. As you note, the p-value is a highly nonlinear transformation of the data, with the sharp nonlinearity occurring at a somewhat arbitrary place in the scale. So, in general, I think it can lead to inferences that throw away information. I did not go to the trouble of following your link and reading the original paper, but my usual view is that it would be better to just analyze the raw data (taking the proportions for each person as continuous data and going from there, or maybe fitting a logistic regression or some similar model to the individual responses).

Gerkin continued:

The long version:
1) Olfactory stimuli (basically vials of molecular mixtures) differed from each other according to the number of molecules they each had in common (e.g. 7 in common out of 10 total, i.e. 3 differences). All pairs of mixtures for which the stimuli in the pair had D differences were assigned to stimulus group D.
2) For each stimulus pair in a group D, the authors computed the fraction of subjects who could successfully discriminate that pair using smell.
3) For each group D, they then computed the fraction of pairs in D for which that fraction of subjects was “significantly above chance”. By design, chance success had p=1/3, so a pair was “significantly above chance” if the fraction of subjects discriminating it correctly exceeded that given by the binomial inverse CDF with x=(1-alpha/2), p=1/3, N=# of subjects. The choice of alpha (an analysis choice) and N (an experimental design choice) clearly drive the results so far. Let’s denote by F that fraction of pairs exceeding the threshold determined by the inverse CDF.
4) They did a linear regression of F vs D. They defined something called a “limen” (basically a fancy term for a discrimination threshold) and set it equal to the solution to 0.5 = beta_0 + beta_1*X, where the betas are the regression coefficients.
5) They then plugged X into yet another equation with more parameters, and the result was their estimate of the number of discriminable olfactory stimuli.

My reply: I’ve seen a lot of this sort of thing, over the years. My impression is that people are often doing these convoluted steps, not so much out of a desire to cheat but rather because they have not ever stepped back and tried to consider their larger goals. Or perhaps they don’t have the training to set up a model from scratch.

Here’s Gerkin again:

I think it was one of those cases where an experimentalist talked to a mathematician, and the mathematician had some experience with a vaguely similar problem and suggested a corresponding framework that unfortunately didn’t really apply to the current problem. The kinds of stress tests one would apply to resulting model to make sure it makes sense of the data never got applied.

And then he continued with his main thread:

If you followed this, you’ve already concluded that their method is unsound even before we get to step 4 and 5 (which I believe are unsound for unrelated reasons). I also generated figures showing that reasonable alternative choices of all of these variables yield estimates of the number of olfactory stimuli ranging from 10^3 to 10^80. I have Python code implementing this reanalysis and figures available at But what I am wondering most is, is there a name for what is wrong with that screening procedure? Is there some adage that can be rolled out, or work cited, to illustrate this to the author?

To which I replied:

I don’t have any name for this one, but perhaps one way to frame your point is that the term “discriminate” in this context is not precisely determined. Ultimately the question of whether two odors can be “discriminated” should have some testable definition: that is, not just a data-based procedure that produces an estimate, but some definition of what “discrimination” really means. My guess is that your response is strong enough, but it does seem that if someone estimates “X” as 10^9 or whatever, it would be good to have a definition of what X is.

Gerkin concludes with a plea:

The one thing I would really, really like is for the fallacy I described to have a name—even better if it could be listed on your lexicon page. Maybe “The Null Hypothesis Screening Fallacy” or something. Then I could just refer to that link instead of to some 10,000 words explanation of it, everytime this comes up in biology (which is all the time).

P.S. Here’s my earlier post on smell statistics.


  1. Statsgirl says:

    Cool stuff! Great to see collaborationo improve the literature.

  2. Garnett says:

    “As you note, the p-value is a highly nonlinear transformation of the data, with the sharp nonlinearity occurring at a somewhat arbitrary place in the scale.”

    Can anyone point me to a good (and accessible) characterization of this phenomenon? I confront this issue with investigators on a daily basis and would like to present this concern to them in a more coherent way.

    • To me, the reason to do a screening is as a tool to make the modeling easier and more robust. For example, you have some process you are studying which has two kinds of things going on “everyday stuff” and “unusual stuff”. You are interested in how the “unusual stuff” works, so you get a bunch of the “everyday stuff” and you look at its distribution of some measurement. Then, whenever you get a new measurement you compare it to the range of stuff seen in your batch of “everyday stuff”. If it’s outside the typical range for that process, you feed the data in to your model of the unusual stuff.

      Now, you don’t have to create a mixture model and analyze thousands or millions of data points almost none of which are relevant to your question of interest, you just create a model for the unusual situation and analyze tens or hundreds of data points. This is extremely common in physics experiments, LHC and LIGO both rely on this I’m sure, otherwise they’d be analyzing enormous quantities of data that isn’t relevant to the questions of interest (things explained by glancing collisions between particles, or earthquakes, or car accidents or whatever)

      The problem comes when you see the filtering as an end in itself “look we found N unusual data points”. This is often written as “these 38 genes are upregulated during XYZ” or something like that. This is much more typical in biology.

      But in this case, it seems more like they’re actually modeling the process of “nasal discrimination” as if it were a statistical significance filter! I haven’t seen that before.

      • Garnett says:

        Thanks for your insights. The biology example is the sort of thing I regularly deal with in neuroscience.

        So, taking your suggestion we would need to construct a reference distribution of a measurement, say a correlation coefficient, from correlations measured in lots of ‘everyday stuff’. How do we decide what measurements contribute to everyday stuff? Are they measurements that consensus states are uninteresting? This seems especially challenging in weakly theoretical sciences….

        • Martha (Smith) says:


          For the type of thing you seem to be talking about (figuring out what is unusual), you might find Brad Efron’s book Large Scale Inference helpful. As I recall, he considers a mixture model (everyday and unusual combined — e.g., might look like a normal with a bump toward one end) and uses empirical Bayes to try to distinguish unusual from tail of usual.

        • Garnett:

          It takes some creativity to think up what to do. But I think the first step is to acknowledge what it is you’re *trying* to do. In the example of say gene up and downregulation consider the following:

          we measure counts using RNA-seq for 30,000 genes. We have 3 biological replicates in condition A and 3 in condition B. Now look at single gene X which you’re interested in. Compare the counts for condition A among the three bio replicates to the counts for condition B, do a t-test or something like that. What a small p value is telling you is “the counts in condition B are outside the range of what you might expect if you characterize the distribution of counts in condition A based on a normal sampling assumption and those *3 data points*.”

          Now, how much are you willing to hang your hat on the idea that your sampling distribution *is* normal? I’d basically guarantee you that those count data are *not* even approximately normal. Even the sample average, which is the thing that needs to have a normal distribution, with unknown mean and variance, is probably not very normal for 3 samples from RNA-seq gene counts. Maybe for 20 or 30 biological replicates. But when was the last time someone brought you 20 or 30 biological replicates for *just the controls* let alone an additional 20 or 30 for the experimental condition? Never.

          Right, so you need a way to understand what is the typical range of changes to be expected. One way to go about it might be something like taking the counts(B)/counts(A) across *all the genes and all the bio-replicates* and assume this is composed of three groups, things for which the ratio is “pure noise” and “nothing’s going on in that gene”, and then “consistently upregulated” and “consistently downregulated”.

          So, now you could create a Bayesian mixture model for the three kinds of things, with informed priors, such as a strict ordering of the three locations, giving the “nothing’s going on” distribution a strong prior for the location at “no change” (maybe 0 on a log scale? but don’t forget batch effects and things, you might have biases from one run of the machine to another etc) and maybe describing the up and down regulated distributions as skewed away from zero, perhaps skew-normal or exponentially modified normal or truncated t distributions or something.

          Now, fit this model to the whole *30,000 gene genome* and get inferences for the “nothings going on” case. Use just *this* component to filter your genes based on “unusual under the assumption that nothing’s going on” and then begin your analysis of the genes that come out of that filter…

          or something like that, there are many ways to attack this issue. The goal being simply “eliminate the stuff I’m not interested in and *then* begin the analysis of just the remaining stuff”

Leave a Reply