A standard mode of inference in social and behavioral science is to establish stylized facts using statistical significance in quantitative studies. However, in a world in which measurements are noisy and effects are small, this will not work: selection on statistical significance leads to effect sizes which are overestimated and often in the wrong direction. After a brief discussion of two examples, one in economics and one in social psychology, we consider the procedural solution of open postpublication review, the design solution of devoting more effort to accurate measurements and within-person comparisons, and the statistical analysis solution of multilevel modeling and reporting all results rather than selection on significance. We argue that the current replication crisis in science arises in part from the ill effects of null hypothesis significance testing being used to study small effects with noisy data. In such settings, apparent success comes easy but truly replicable results require a more serious connection between theory, measurement, and data.
The article was published in 2018 but it remains relevant, all these many years later.
The bigger problem is people not understanding the theory behind statistical maths in the first place:
“A variety of tests ( t-test, F-test, chi-square, goodness of fit, etc.) are used to announce that the differnece between two [variations] is either significant or not significant. Unfortunately, such calculations are a mere formality. Significance or lack of it provides no degree of belief – high, moderate, or low – about prediction of performance in the future, which is the only reason to carry out the comparison in the first place, test, or experiment in the first place.”
and then
“Any symmetric function of a set of numbers almost always throws away a large portion of the information in the data. Thus, interchange of any two numbers in the calculation of the mean of a set of numbers, their variance, or their fourth moment does not change the mean, variance, or fourth moment. A statistical test is a symmetric function of the data.”
Deming, 1990
edit: ” which is the only reason to carry out the comparison, test, or experiment in the first place.”
“The article was published in 2018 but it remains relevant, all these many years later.”
Recognizing that time moves on in a nonlinear way, but 2018 was only 4 years ago. On the other hand, I guess not much has happened since then besides the odd pandemic and occasional invasion and insurrection.
This seems to imply the small effects and noisy data are something independent of NHST. We should also consider whether that situation has resulted from NHST over time.
When everyone is failing to test their theories a lot of junk gets built up that is used to inform future experiments. At some point the confusion can make it impossible to even discover large effects.
For example, lets say we want to know whether firemen should put water on a house fire.
You randomly select 100 houses and pay the owners, or build model houses, whatever. The point is there should be variability in the size/construction/contents of these houses.
Now start some fires and let it burn for 20 minutes or however long it usually takes for people to notice.
Next comes the water. Split into two groups, treatment and control. Each house in the treatment group is allotted n gallons of water that the firemen spray on the house at some standard rate. Control group gets nothing.
Then we examine the results and see a mix of:
1) Houses that survived intact because it happened to be raining when the fire started. Or any other reason the fire may fizzle out on its own.
2) Houses that burned too fast, 20 minutes was too late.
3) Houses that were too small or unsturdy and even though the fire was put out they got flooded or washed away by the water.
4) Houses with larger fires (perhaps larger houses), not enough water was used so the firemen had to watch it burn down when they ran out.
5) Houses that did indeed benefit from the allotted amount of water.
You can see the problem in the experiment due to lack of theoretical understanding. The amount of water needs to be proportional to the size of the fire. Without that knowledge, the obvious large effect becomes muddled in noise.
And if the standard amount of water chosen was far too small or large, then you could see near zero benefit or even harm.
Also, this classic story from Richard Feynman:
https://sites.cs.ucsb.edu/~ravenben/cargocult.html
Apparently no one has ever figured out who “Mr. Young” referred to though. But the point is that some foundational knowledge is usually required to see anything but “small effects with noisy data.”
There are papers by Paul Thomas Young about feeding rats that conform in a general sense to Feynman’s story, although the ones I cite below do not mentions and or sound:
Young, P. T. (1932). Relative food preferences of the white rat. Journal of Comparative Psychology, 14(3), 297–319. https://doi.org/10.1037/h0074945
Young, P. T. (1938). Preferences and demands of the white rat for food. Journal of Comparative Psychology, 26(3), 545.
Young, P. T. (1957). Psychologic factors regulating the feeding process. The American Journal of Clinical Nutrition, 5(2), 154-161.
I chose these in part because Feynman cites 1937. This could be a miscite of 1957, or it could be that the work was published in 1938, although circulated earlier.
The 1932 and 1938 articles describes in great detail modifying the apparatus to test for rat food preferences in all manner of ways; they are not the ways described by Feynman, but in fact those examples he cites are less sophisticated than what is described in the paper; perhaps not all details of the work made it into print, or perhaps I simply haven’t found the right papers of Thomas (there were many). However, the process of the realization of the experiments that is described conforms in spirit to what Feynman relates.
It could be. That research roughly matches what he claimed, and I don’t think Feynman ever did any experiments with rats himself so he could easily gloss over details. Most people do not really distinguish between experiments using rats vs mice without having firsthand experience.
Some suggestive evidence that Feynman’s Mr. Young might have been rat-runner psychologist Paul Thomas Young:
https://www.lesswrong.com/posts/6vSJe9WXCNvy3Wpoh/the-decline-effect-and-the-scientific-method-link
John:
Interesting discussion and a great example of the way in which a scientific story becomes more useful with clear sourcing, as this anchors the story to reality (a point discussed by Thomas Basbøll and me here and here).
From your first link (second link didn’t show up):
That is funny to see Szent-Gyorgyi come up. He is an interesting character:
https://en.wikipedia.org/wiki/Albert_Szent-Gy%C3%B6rgyi
As to Paul Thomas Young, maybe. But you would think there would be a paper or dissertation that matches Feynman’s description. It is frustrating that Feynman was never asked who he was referring to.
Anon:
Link fixed; thanks.
Regarding the Feynman story: yeah, that was my point. If he’d provided a reference, that would anchor his story in reality and make it immutable, thus more useful. My guess is that Feynman was referring to something he’d vaguely remembered, and he garbled the details which allowed him to make whatever point he wanted to make.
I was thinking something like in 30 years the thought experiment I posted above (on your blog) got repeated by someone as “Mr. Gelman’s experiment on putting out fires with water”.
Poor Andrew–having Anoneuoid’s crackpot thoughts being attributed to him. Anoneuoid?–I meant user “Nonbel” from ycombinator:
https://news.ycombinator.com/threads?id=nonbel
It’s the same fellow: virtually the same posts involving the same crackpot thoughts, and similarly frustrated people that have the misfortune of interacting with him.