Everyone else seems to be defending the intellectually indefensible.

]]>Yes, this comes up a lot. I agree that the problem is with null hypothesis significance testing, not with p-values. If you do null hypothesis significance testing in other ways (for example, using Bayes factors), the same problems arise.

]]>Yes, I agree. Michael Lew’s paper here convinced me of that: http://arxiv.org/abs/1311.0081

It is the combination of a strawman with the concept of “statistical significance” (ie the filtering step) that seems to be a problem, not the p-value per se.

]]>When effect size is large and bias and variance are low, I think the p-value can be a useful summary, if interpreted carefully. When effect size is low and bias and variance are high, I think the p-value can be super misleading, which was the point of my graph.

The graph doesn’t really stand alone; it’s a response to all those “Psychological Science”-style studies we’ve been talking about here for the past few years.

]]>You ask, “In that case, what justification is there for calculating this p-value in the first place?” I think there is no good justification! What I’m doing here is commenting on people who *do* compute these p-values, as in the paper discussed in the references. My whole point is that the study in question is pointless!

In that case, what justification is there for calculating this p-value in the first place? If you have a point prediction, why not test that (here: the hypothesis that mean(a)-mean(b)=2)? If the p-value is low we would say “our effect size was such and such, however this data does not appear consistent with the previous literature”. If the p-value is high we could say “our effect size was such and such, which is consistent with what we would expect from the literature”.

]]>You write, “in real life we only have the observed effect.” No! In real life we typically have a lot more information. That’s the point of my paper with Carlin, and indeed of the above example. The hypothesized effect size of 2 percentage points (which really is more of a hypothesized upper bound on the effect size) comes from substantive information on public opinion and voting, external to (in statistical terms, “prior to”) the observed effect from that particular study.

]]>Right, but in real life we only have the observed effect. That chart suggests an observed effect of ~20 is consistent with very small true effect. Of course we are more likely to observe an effect of 20 if the true effect is near that value. However, if we do not know how much “filtering” has gone on then it is not really possible to distinguish between the two (small vs large effect) scenarios. It seems that under conditions where that chart is relevant to research practice, then the p-value calculated using effect=0 cannot be meaningful.

This reminds me of the following discussion:

“As I hope is clear from our example, NHST as a method depends upon a faith in the perfection of our fellow researchers that will easily fall victim to any mixture of incompetence or malice on their part. Unlike a descriptive statistic such as a mean, a p-value purports to tell us something that it cannot do without perfect information about the exact scientific methods used by every researcher in our community. An individual researcher will necessarily have this sort of perfect information about their own work, but a community will typically not. The imperfect information available to the community implies that reasoning about the community’s ideal standards for measuring evidence based on the ideal standards for a hypothetical individual will be systematically misleading.”

http://www.johnmyleswhite.com/notebook/2012/05/10/criticism-1-of-nhst-good-tools-for-individual-researchers-are-not-good-tools-for-research-communities/

I think you want to consider this graph as a thought experiment. Suppose the real effect is small, but we also have small N and high variability in the outcome. Then, if you do find some difference in means that is statistically significant, it IS an over-estimate of the true effect. I remember realizing this last year when I did a lecture on power calculations and thought: wow, that is an interesting perspective to think about, particularly in relation to these N=20 type experiments Andrew has been talking about so much.

]]>If we do expect a certain effect size (or there is one of "practical significance") then shouldn't that be the null hypothesis?

]]>