I prepared the above image for this talk. The calculations come from the second column of page 6 of this article, and the psychology study that we’re referring to is discussed here.

I prepared the above image for this talk. The calculations come from the second column of page 6 of this article, and the psychology study that we’re referring to is discussed here.

Scary indeed. Extreme, mind you, because everyone thinks they have 80% power, and usually they do have more than 6%. But yes, it is scary. Graphs like these should more frequently be part of the discussion on power and ethics where the usual emphasis is on wasting resources and exposing human subjects to unnecessary risks due to poorly thought out designs. All valid concerns, but also what about the distorting effects of low power on the scientific reporting? The sampling distribution for effect sizes here conditional on p < .05 has nothing whatsoever to do with the true distribution. Well, I guess ideally a meta-analysis (and access to file drawers) could eventually trace out the shape of the sampling distribution. That will only happen after a lot of confusion and misdirection, and many many tabloid headlines.

If we have no idea what effect to expect, so effect=0 is the null hypothesis used, what exactly can one learn from the p-value? According to that chart, if we observe a p<0.05 result with effect size of +/-20 we can't say much at all. Perhaps that is near the "true" effect size, or maybe the real one is near zero, or perhaps it's in the totally wrong direction.

If we do expect a certain effect size (or there is one of "practical significance") then shouldn't that be the null hypothesis?

Question,

I think you want to consider this graph as a thought experiment. Suppose the real effect is small, but we also have small N and high variability in the outcome. Then, if you do find some difference in means that is statistically significant, it IS an over-estimate of the true effect. I remember realizing this last year when I did a lecture on power calculations and thought: wow, that is an interesting perspective to think about, particularly in relation to these N=20 type experiments Andrew has been talking about so much.

jrc,

Right, but in real life we only have the observed effect. That chart suggests an observed effect of ~20 is consistent with very small true effect. Of course we are more likely to observe an effect of 20 if the true effect is near that value. However, if we do not know how much “filtering” has gone on then it is not really possible to distinguish between the two (small vs large effect) scenarios. It seems that under conditions where that chart is relevant to research practice, then the p-value calculated using effect=0 cannot be meaningful.

This reminds me of the following discussion:

“As I hope is clear from our example, NHST as a method depends upon a faith in the perfection of our fellow researchers that will easily fall victim to any mixture of incompetence or malice on their part. Unlike a descriptive statistic such as a mean, a p-value purports to tell us something that it cannot do without perfect information about the exact scientific methods used by every researcher in our community. An individual researcher will necessarily have this sort of perfect information about their own work, but a community will typically not. The imperfect information available to the community implies that reasoning about the community’s ideal standards for measuring evidence based on the ideal standards for a hypothetical individual will be systematically misleading.”

http://www.johnmyleswhite.com/notebook/2012/05/10/criticism-1-of-nhst-good-tools-for-individual-researchers-are-not-good-tools-for-research-communities/

Question:

You write, “in real life we only have the observed effect.” No! In real life we typically have a lot more information. That’s the point of my paper with Carlin, and indeed of the above example. The hypothesized effect size of 2 percentage points (which really is more of a hypothesized upper bound on the effect size) comes from substantive information on public opinion and voting, external to (in statistical terms, “prior to”) the observed effect from that particular study.

Andrew,

In that case, what justification is there for calculating this p-value in the first place? If you have a point prediction, why not test that (here: the hypothesis that mean(a)-mean(b)=2)? If the p-value is low we would say “our effect size was such and such, however this data does not appear consistent with the previous literature”. If the p-value is high we could say “our effect size was such and such, which is consistent with what we would expect from the literature”.

Question:

You ask, “In that case, what justification is there for calculating this p-value in the first place?” I think there is no good justification! What I’m doing here is commenting on people who

docompute these p-values, as in the paper discussed in the references. My whole point is that the study in question is pointless!Ha. Ok then. As far as I can tell, it is _always_ misguided to filter results by rejecting a strawman. Your chart seems to suggest there is some other case for which that procedure can be useful. I am interested in examples of those.

Question:

When effect size is large and bias and variance are low, I think the p-value can be a useful summary, if interpreted carefully. When effect size is low and bias and variance are high, I think the p-value can be super misleading, which was the point of my graph.

The graph doesn’t really stand alone; it’s a response to all those “Psychological Science”-style studies we’ve been talking about here for the past few years.

“When effect size is large and bias and variance are low, I think the p-value can be a useful summary, if interpreted carefully.”

Yes, I agree. Michael Lew’s paper here convinced me of that: http://arxiv.org/abs/1311.0081

It is the combination of a strawman with the concept of “statistical significance” (ie the filtering step) that seems to be a problem, not the p-value per se.

Question:

Yes, this comes up a lot. I agree that the problem is with null hypothesis significance testing, not with p-values. If you do null hypothesis significance testing in other ways (for example, using Bayes factors), the same problems arise.

This really explains part of the scale up problem, i.e. that some pilot study finds an effect size with p<.05 on a smallish sample, then the intervention is implemented at other sites and it turns out to have a much smaller effect. Usually I attribute that to less investment in the intervention, less compliance with the original protocol, inevitable compromises on the ground but really, maybe it is just an artifact of overestimation.

[…] a really nice figure from Andrew Gelman, illustrating the expected distribution of estimated effect sizes for a […]

[…] “This is what “power = .06” looks like. Get used to it” http://statmodeling.stat.columbia.edu/2014/11/17/power-06-looks-like-get-used/ … […]

[…] sort of collage of all the “power = .06″ studies we talk about here: himmicanes and hurricanes, beauty and sex ratio, ovulation and just […]

[…] the “yeah, right” corner—and, if you’re lucky, you’ll understand the “power = .06″ point and not get so excited about the noise you’ve been staring at. Maybe not, maybe […]

[…] me: – I wonder if you’d be ok to help me to understanding this Gelman’s graph. I struggle to understand what is the plotted distribution and the exact meaning of the red area. […]

[…] subject. I need to sit down at some point and figure out how her argument relates to the apparently-opposing argument of Andrew Gelman. I think that they’re asking subtly but importantly different questions, […]

[…] who deny the value of replications. They talk about science and they don’t always want to hear my statistical arguments, but then if you ask them why we “have no choice but to accept” claims about embodied cognition […]

[…] with small-N studies, and the resulting imprecision of the results. That one has been hashed out elsewhere, at length, so as I said, I’m not mentioning […]

Personality psychology is anything but precise and though they claim to care about effect sizes, they don’t actually care about effect sizes:

http://www.pbarrett.net/publications/Rethinking_reliability_and_validity_of_Psychological%20Measurements_Barrett_Prinsloo_2013.pdf

Though some do and they are trying to change the way in which the field approaches the problem.

Everyone else seems to be defending the intellectually indefensible.

[…] embarrassing non-replication (which shouldn’t’ve been embarrassing at all given the low low power and many researcher degrees of freedom of the original study which had gotten them on the wrong […]

[…] demonstrated this with an extreme case a couple years ago in a post entitled, “This is what “power = .06” looks like. Get used to it.” We were talking […]

[…] much noise. Going for statistical significance won’t work in those “power = .06” studies, cos if you do get lucky and find statistical significance, it tells you just about nothing anyway. […]

[…] found it on Andrew Gelman’s blog here. The red areas are based on an alpha (probability of rejecting a null hypothesis) of 0.05. A […]

[…] And, no, this ain’t happening. We don’t have 80% power. Heck, we’re lucky if we have 6% power. […]

[…] the familiar “power = .06” disaster: take an overestimated effect size from a previous noisy study, then design a new study under these […]

[…] of the original effect sizes on average (presumably an illustration of Andrew Gelman’s “exaggeration ratio“). More interesting to me was that prediction markets and surveys of social scientists did an […]

[…] use and interpretation of null hypothesis tests. Especially in the context of low-powered studies, where type M and type S errors are likely. Personally, I think the problems in how we use and teach frequentist statistics are more to do […]

John Tukey used to set alpha to 0.10 because if he was wrong to make a decision he would still get the direction correct half of the time by chance. Was he wrong? The graph shows that Type S error risk is not half of a two-tailed alpha. Should we be making directional decisions based on a lifelong risk of Type S errors instead of p values?

I get it now. Half of alpha is the worst case scenario.