The Failure of Null Hypothesis Significance Testing When Studying Incremental Changes, and What to Do About It

Here it is:

A standard mode of inference in social and behavioral science is to establish stylized facts using statistical significance in quantitative studies. However, in a world in which measurements are noisy and effects are small, this will not work: selection on statistical significance leads to effect sizes which are overestimated and often in the wrong direction. After a brief discussion of two examples, one in economics and one in social psychology, we consider the procedural solution of open postpublication review, the design solution of devoting more effort to accurate measurements and within-person comparisons, and the statistical analysis solution of multilevel modeling and reporting all results rather than selection on significance. We argue that the current replication crisis in science arises in part from the ill effects of null hypothesis significance testing being used to study small effects with noisy data. In such settings, apparent success comes easy but truly replicable results require a more serious connection between theory, measurement, and data.

The article was published in 2018 but it remains relevant, all these many years later.

13 thoughts on “The Failure of Null Hypothesis Significance Testing When Studying Incremental Changes, and What to Do About It

  1. The bigger problem is people not understanding the theory behind statistical maths in the first place:

    “A variety of tests ( t-test, F-test, chi-square, goodness of fit, etc.) are used to announce that the differnece between two [variations] is either significant or not significant. Unfortunately, such calculations are a mere formality. Significance or lack of it provides no degree of belief – high, moderate, or low – about prediction of performance in the future, which is the only reason to carry out the comparison in the first place, test, or experiment in the first place.”

    and then

    “Any symmetric function of a set of numbers almost always throws away a large portion of the information in the data. Thus, interchange of any two numbers in the calculation of the mean of a set of numbers, their variance, or their fourth moment does not change the mean, variance, or fourth moment. A statistical test is a symmetric function of the data.”

    Deming, 1990

  2. “The article was published in 2018 but it remains relevant, all these many years later.”

    Recognizing that time moves on in a nonlinear way, but 2018 was only 4 years ago. On the other hand, I guess not much has happened since then besides the odd pandemic and occasional invasion and insurrection.

  3. We argue that the current replication crisis in science arises in part from the ill effects of null hypothesis significance testing being used to study small effects with noisy data.

    This seems to imply the small effects and noisy data are something independent of NHST. We should also consider whether that situation has resulted from NHST over time.

    When everyone is failing to test their theories a lot of junk gets built up that is used to inform future experiments. At some point the confusion can make it impossible to even discover large effects.

    For example, lets say we want to know whether firemen should put water on a house fire.

    You randomly select 100 houses and pay the owners, or build model houses, whatever. The point is there should be variability in the size/construction/contents of these houses.

    Now start some fires and let it burn for 20 minutes or however long it usually takes for people to notice.

    Next comes the water. Split into two groups, treatment and control. Each house in the treatment group is allotted n gallons of water that the firemen spray on the house at some standard rate. Control group gets nothing.

    Then we examine the results and see a mix of:

    1) Houses that survived intact because it happened to be raining when the fire started. Or any other reason the fire may fizzle out on its own.

    2) Houses that burned too fast, 20 minutes was too late.

    3) Houses that were too small or unsturdy and even though the fire was put out they got flooded or washed away by the water.

    4) Houses with larger fires (perhaps larger houses), not enough water was used so the firemen had to watch it burn down when they ran out.

    5) Houses that did indeed benefit from the allotted amount of water.

    You can see the problem in the experiment due to lack of theoretical understanding. The amount of water needs to be proportional to the size of the fire. Without that knowledge, the obvious large effect becomes muddled in noise.

    And if the standard amount of water chosen was far too small or large, then you could see near zero benefit or even harm.

    • Also, this classic story from Richard Feynman:

      All experiments in psychology are not of this type, however. For example, there have been many experiments running rats through all kinds of mazes, and so on–with little clear result. But in 1937 a man named Young did a very interesting one. He had a long corridor with doors all along one side where the rats came in, and doors along the other side where the food was. He wanted to see if he could train the rats to go in at the third door down from wherever he started them off. No. The rats went immediately to the door where the food had been the time before.

      The question was, how did the rats know, because the corridor was so beautifully built and so uniform, that this was the same door as before? Obviously there was something about the door that was different from the other doors. So he painted the doors very carefully, arranging the textures on the faces of the doors exactly the same. Still the rats could tell. Then he thought maybe the rats were smelling the food, so he used chemicals to change the smell after each run. Still the rats could tell. Then he realized the rats might be able to tell by seeing the lights and the arrangement in the laboratory like any commonsense person. So he covered the corridor, and still the rats could tell.

      He finally found that they could tell by the way the floor sounded when they ran over it. And he could only fix that by putting his corridor in sand. So he covered one after another of all possible clues and finally was able to fool the rats so that they had to learn to go in the third door. If he relaxed any of his conditions, the rats could tell.

      Now, from a scientific standpoint, that is an A-number-one experiment. That is the experiment that makes rat-running experiments sensible, because it uncovers that clues that the rat is really using– not what you think it’s using. And that is the experiment that tells exactly what conditions you have to use in order to be careful and control everything in an experiment with rat-running.

      I looked up the subsequent history of this research. The next experiment, and the one after that, never referred to Mr. Young. They never used any of his criteria of putting the corridor on sand, or being very careful. They just went right on running the rats in the same old way, and paid no attention to the great discoveries of Mr. Young, and his papers are not referred to, because he didn’t discover anything about the rats. In fact, he discovered all the things you have to do to discover something about rats. But not paying attention to experiments like that is a characteristic example of cargo cult science.

      https://sites.cs.ucsb.edu/~ravenben/cargocult.html

      Apparently no one has ever figured out who “Mr. Young” referred to though. But the point is that some foundational knowledge is usually required to see anything but “small effects with noisy data.”

      • There are papers by Paul Thomas Young about feeding rats that conform in a general sense to Feynman’s story, although the ones I cite below do not mentions and or sound:

        Young, P. T. (1932). Relative food preferences of the white rat. Journal of Comparative Psychology, 14(3), 297–319. https://doi.org/10.1037/h0074945
        Young, P. T. (1938). Preferences and demands of the white rat for food. Journal of Comparative Psychology, 26(3), 545.
        Young, P. T. (1957). Psychologic factors regulating the feeding process. The American Journal of Clinical Nutrition, 5(2), 154-161.

        I chose these in part because Feynman cites 1937. This could be a miscite of 1957, or it could be that the work was published in 1938, although circulated earlier.

        The 1932 and 1938 articles describes in great detail modifying the apparatus to test for rat food preferences in all manner of ways; they are not the ways described by Feynman, but in fact those examples he cites are less sophisticated than what is described in the paper; perhaps not all details of the work made it into print, or perhaps I simply haven’t found the right papers of Thomas (there were many). However, the process of the realization of the experiments that is described conforms in spirit to what Feynman relates.

        • It could be. That research roughly matches what he claimed, and I don’t think Feynman ever did any experiments with rats himself so he could easily gloss over details. Most people do not really distinguish between experiments using rats vs mice without having firsthand experience.

    • John:

      Interesting discussion and a great example of the way in which a scientific story becomes more useful with clear sourcing, as this anchors the story to reality (a point discussed by Thomas Basbøll and me here and here).

      • From your first link (second link didn’t show up):

        Like Holub, he invoked Albert Szent-Gyorgyi, the Nobel Prize–winning physiologist, as the original source of the story (though he did not clearly cite Holub as the source for this source).

        That is funny to see Szent-Gyorgyi come up. He is an interesting character:
        https://en.wikipedia.org/wiki/Albert_Szent-Gy%C3%B6rgyi

        As to Paul Thomas Young, maybe. But you would think there would be a paper or dissertation that matches Feynman’s description. It is frustrating that Feynman was never asked who he was referring to.

        • Anon:

          Link fixed; thanks.

          Regarding the Feynman story: yeah, that was my point. If he’d provided a reference, that would anchor his story in reality and make it immutable, thus more useful. My guess is that Feynman was referring to something he’d vaguely remembered, and he garbled the details which allowed him to make whatever point he wanted to make.

        • My guess is that Feynman was referring to something he’d vaguely remembered, and he garbled the details which allowed him to make whatever point he wanted to make.

          I was thinking something like in 30 years the thought experiment I posted above (on your blog) got repeated by someone as “Mr. Gelman’s experiment on putting out fires with water”.

        • Poor Andrew–having Anoneuoid’s crackpot thoughts being attributed to him. Anoneuoid?–I meant user “Nonbel” from ycombinator:

          https://news.ycombinator.com/threads?id=nonbel

          It’s the same fellow: virtually the same posts involving the same crackpot thoughts, and similarly frustrated people that have the misfortune of interacting with him.

Leave a Reply

Your email address will not be published. Required fields are marked *