Chris, translate Andrew’s statement from p-value talk into plain English, and that may help you construct examples. A low p-value means the result is rare or surprising under the null. For that not to be strong evidence against the null, it should be true that the result is also rare under reasonable alternatives. Construct an example with that property.

An example I’m partial to can be found in this review article, in the discussion of Fig 6:

Probabilistic Record Linkage in Astronomy: Directional Cross-Identification and Beyond

http://www.annualreviews.org/doi/10.1146/annurev-statistics-010814-020231

The problem is one of coincidence assessment, in this case, in directional statistics: You measure, with uncertainty, the directions to two objects on the sky, and the point estimates are near each other. Is this evidence for them being associated (sharing a common true direction), or is it merely a coincidence? Suppose the point estimates are close enough to have a small p-value under the null of a uniform distribution on the sphere, i.e., they are surprisingly close in great-circle distance. If the measurements are very precise, then under the alternative hypothesis that they are associated, it’s not surprising that they are close (indeed, it’s to be expected). But if the measurements have large uncertainty, then even under the alternative it would be surprising to have the point estimates be close to each other. As a result, the Bayes factor can only weakly favor association, even when the p-value under the null is quite small.

]]>Anon, yes that is an unfortunate misinterpretation of p-values that is sometimes made. Although, I am sure not by the people who wrote the Higgs discovery papers.

]]>Anon, well I agree that the headline “chance ruled out with 5-sigma confidence!” is not strictly correct as the 5 sigma threshold does not account for garden of forking paths or systematics below a level of about 1 sigma.

While these issues also exist, that is not what I was talking about above. Instead it was that there are always multiple explanations for a given observation and the scientists job is to winnow down the possibilities. Ruling out “chance” is only one minor part of this but gets all the attention. This seems to be because 1) it is the easiest thing to rule out, and 2) people are confused/wishful about what the p-value means (ie they think it somehow corresponds to the probability their theory is correct).

]]>Anon, well I agree that the headline “chance ruled out with 5-sigma confidence!” is not strictly correct as the 5 sigma threshold does not account for garden of forking paths or systematics below a level of about 1 sigma. I think the main point is that it is a threshold which has proven in a vast number of examples to be reliable in detecting discoveries in particle physics. As we agreed earlier it does require a confirming experiment as well as sometimes there are systematics larger than the 1 sigma level that have been incorrectly not accounted for.

]]>Anon:

Pages 61-62 of this classic discussion by I. J. Good are relevant to this point.

]]>It isn’t wrong to rule out chance, but there is no rational reason to focus on ruling out that vs any other explanation. In fact, that is the least interesting thing to rule out.

Instead what we see is headlines about “chance ruled out with 5-sigma confidence!”. This is a sign of misunderstanding and that the people involved are likely to come to incorrect conclusions.

]]>I agree testing for other explanations is important but I don’t think that implies it is also wrong to test for chance.

]]>Right, rule out chance, rule out systematic detector errors, rule out problems with the filtering algos, rule out other proposed particles. These are all needed. Why should ruling out chance have a privileged position over any other explanation besides the Higgs? It makes no sense unless you misunderstand what the p-value is telling you.

]]>Anoneuoid, I agree it was important that the masses were consistent within the errors. I think that in addition to a significant deviation from the null was needed.

]]>Sure, I agree 100% on the need for independent replication. These huge physics experiments aren’t even really independent enough for my taste, but practical issues really do get in the way there. But imagine if they both saw “5-sigma” signals, but at different mass… clearly it is not the statistically significant deviation from the null model that is crucial.

]]>So for example in the Higgs discovery, it was crucial that both the Atlas and CMS experiment had seen a 5 sigma detection.

The only way it could be crucial is if every other possible explanation for any deviation from the null model had been ruled out. If that is not the case, the p-value has surely been misinterpreted.

Also please remember the problem most people have is not really with the p-value. It is with choosing a null hypothesis that nobody believes.

]]>I didn’t know Webern’s three pieces for cello and piano–listened to them twice just now, along with Lynn Harrell’s introduction (“Concentrate ferociously, ’cause it’s like a black hole”). Well worth the listening; I look forward to many more!

]]>“I can’t think of a particularly good alternative to p-values”

Bayesian probability distributions over parameters in a well developed mechanistic model for the result.

Suppose you have some coupled climate / ecological model for desertification or forestation or some such process. Which would be more convincing to you that the process is actually occurring at location X

1) At location X changes since last year are 5 sigma away from changes seen at 100 random locations on the face of the earth, but you have no mechanistic model?

2) When fitting your mechanistic model to 15 years of historical measurements in the vicinity of point X a parameter which implies desertification is the asymptotic result whenever it is greater than 1 has posterior 95% high probability density region 1.13 to 2.27

]]>jrc and Anoneoid, I think, at least in physics and other similar sciences, one does need some kind of convention for what is a discovery. So I can’t think of a particularly good alternative to p-values or the often used equivalent of seeing how many sigma’s one is from the null value. But one other convention which is also generally followed is that the discovery has to be made by more than one experiment. So for example in the Higgs discovery, it was crucial that both the Atlas and CMS experiment had seen a 5 sigma detection.

]]>tl;dr: 5 sigma thresholds make it more difficult (read: expensive) to detect differences between groups. This, though, does not solve the most important issues in scientific inference, and the false-promise that it does is itself part of the problem. The statistical null is a stupid object, and the rejection of it via p-values it tells us very little.

Yes, LIGO is a great example. The p-value they report does not help at all in distinguishing between some kind of atmospheric effect, something from the sun, national power grid fluctuations etc and a gravitational wave.

They claim they ruled out all other plausible reasons for such a signal (and an insane amount of effort was put towards doing this), and that may be true. But let’s not give undue credit to the “rejecting chance” step in the process.

]]>That doesn’t solve the issue of when the statistical null hypothesis might be non-0 even if the theoretical claim predicting the non-0 effect is not true.

Suppose you say that under theory A, light shouldn’t be affected by gravity. Then you observe that light behaves differently around large gravitational fields than it does in their absence (the two groups of observations – light moving with and without a large gravity source nearby – were unlikely to be generated from the same data-generating process). What has this experiment “proved” (even in the sense of probability statements)? General Relativity is (probably) right? That the previous models missed something important? That gravity-inducing objects also induce changes in the behavior of light waves/particles? That the methods we use to measure the movement of light through space are affected by gravity?

How about an older example: Suppose I believe that the planets are on a different layer of revolving spheres around the earth than the stars. I predict that, if they were on the same sphere, that Jupiter would be in location X on January 1st of the year 1186. Under my preferred model, I predict it will be in location Y. I then confirm it is in location Y, and can rule-out it is in location X. Have I proved my theory of celestial-sphere-nesting?

I’m sticking with physics-y examples because they are useful ways of pointing out that “rejecting the null” does not ever really tell us much about the world, even under “ideal” conditions. Sure, it can help us realize certain features of the world are not consistent with certain models of it. But the usefulness of the exercise comes from the quality of the research design, the strength of the novel theoretical prediction made by one (theoretical, not statistical) model relative to another potential (theoretical) model, and the relationship between the predictions themselves and the tests of them in the world. Even in physics these steps often fail – in social science the failures are even more obvious: samples and treatments are convenience-based; theoretical models don’t make precise predictions that can be tested against each other; and we can almost never measure the actual thing we are interested in (intrinsic motivations, feelings, attitudes, preferences, behavioral trade-offs, thoughts).

So in general – even if researchers (in physical or social sciences) do everything “perfectly” according to statistical methods, a low p-value rarely tells us much about the world. It often just tells us a lot about how one researcher was able to interpret their results according to their preferred (and pre-approved by Science) metaphysics of existence.

Now, I understand that this doesn’t relate to rejection rates in purely idealized experimental settings. And you are right in general that p-values can, under certain conditions that are unrelated to sample size or variance (given some min N and max V), give us appropriate rejection rates.

But that means very little epistemologically. And when you say things like “the null hypothesis should be neglected when true only in about 1 in 3.3 million cases” then I think that statement is either intentionally or unintentionally conflating rejecting the statistical null hypothesis (some two groups of observations come from the same DGP) with some theoretically-motivated alternative hypothesis (something that could be “true” in the sense of scientifically meaningful and interpretable). I mean, sure, we could say it is “true” that the alternative statistical hypothesis may be favored over the statistical null, but then all we are saying is “there is probably a difference between these two groups.” We don’t get to say anything about why unless we can show that under no other theoretical model this thing could happen. But how many explanations exist for, say, the black-white earnings gap; or the differences in educational attainment across countries; or the cause of schizophrenia; or the nature of the transmission of knowledge across people? And how many of them are actually identifiable from each other in terms of concrete predictions about the world? I mean, you could literally give the same sets of results from some social science study and have 10 groups of researchers give 10 interpretations, each within their own metaphysical frameworks and each one equally consistent with the data.

tl;dr: 5 sigma thresholds make it more difficult (read: expensive) to detect differences between groups. This, though, does not solve the most important issues in scientific inference, and the false-promise that it does is itself part of the problem. The statistical null is a stupid object, and the rejection of it via p-values it tells us very little.

]]>Ha, I figured you’d know Schnittke’s sonata, but maybe someone didn’t know and they might be inspired to check it out since we talked about it. Webern is my favourite composer, right there with Schnittke, and he was a cellist too. You know the three pieces for cello and piano, opus 11?

Also relating to cello and Webern… I used to work in a library and they had a cat-alogue of Webern’s works and for some reason they’d included cello in the instrumentation of Webern’s concerto (op. 24). Being a music nerd I obviously noticed that as a mistake and notified the persons responsible for the catalogue. But even though it is in reality a cello-less piece, it’s still a wonderful piece of music and I’d recommend the 2nd movement to anyone!

]]>Psyoskeptic, I am not quite following your point. I was referring to a 5 sigma cutoff where without forking paths the null hypothesis should be neglected when true only in about 1 in 3.3 million cases.

]]>Sad that it is this way. I hope it will change eventually — the sooner the better.

]]>Chris, that’s correct but consider that in the 1:20 times (assuming no forking paths at all) most people would reject the null when it’s true. The p-value is still uniformly distributed in that <.05 range. It's equally probable that you'll obtain any individual p-value in the range <.05 (assuming continuous data). And, therefore, very low values don't have any more meaning than ones close to .05. Only the cutoff mattered.

]]>Sure, I agree with you. What I’m saying is that embracing uncertainty will be a career killer. What I now do is to embrace uncertainty but don’t let the reader realise that. I rely on the power of pragmatics to lead the reader into thinking that I delivered closure. Oops, I just revealed my secret. Luckily, not many psycholinguists read this blog.

]]>Martha, thanks for letting me know. I think that p-values are OK in subjects like physics when you have large sample sizes and high signal to noise. But they don’t work so well in subjects like sociology where the samples sizes are usually small and you usually have low signal to noise. in areas like particle physics they have a 5\sigma threshold for discovery and so I think that generally gets around the problems Andrew is concerned about. But I understand that Andrew is interested more in the social sciences case and also agree Bayesian methods are very useful in physics particularly in cases when you don’t have large sample sizes and high signal to noise.

]]>Chris,

The reply I intended to your comment here ended up above (below my original comment). Apologies for the double mix-up

]]>I guess I messed up in what I said. What I should have said was: With larger sample sizes, you will be rejecting the null hypothesis with smaller effect size estimates than what you would get with smaller sample sizes. (Related to “larger sample sizes give higher power, and to the “winner’s curse” phenomenon that falsely rejecting the null tends to give inflated effect size estimates.)

]]>Andrew, thanks, I will take a look at that paper.

]]>Chris: With a large enough sample size, you can get tiny p-values when the null hypothesis is true. (Try a simulation yourself if you don’t believe me.)

]]>Closure, shmosure. Closure shuts out uncertainty, so has no place in most scientific work. Science is inherently open-ended.

]]>A few years back I would’ve said that if you read, say, 10 or 20 papers, probably at least one of them offers up a low p-value that provides no evidence against the null.

Then after a couple years of reading the blog, I would’ve said that if you read, say, 10 or 20 papers, probably most of them offer up a low p-value that provides no evidence against the null.

But now after like 4 or 5 years reading the blog, I’d say that actually most studies convince me that the null is not true. But only because the null is never true and we don’t even need any evidence – almost everything people study has lots of effects, and these effects vary in relative magnitude both across people and within people over time.

…that said, I’ve also learned about how easy it is to get a low p-value even in simulated environments with absolutely no treatment effect when you have a) multiple noisy measures of a set of outcomes; b) researcher freedom to explore specifications; c) high powered incentives for researchers to find a low p-value; and d) a shocking and perhaps shameful ignorance of what statistics is and isn’t (or can and can’t do) among empirical social science researchers. So you could take Andrew’s statement that way too, which is probably closer to how he meant it in this context.

]]>Chris:

Yes, but if effect size is small and estimates are noisy, then “p less than .05” provides very little information even for a preregistered study with no forking paths. Carlin and I discuss this in our 2014 paper. That said, forking paths play an important role here in facilitating the production and publication of such noisy results.

]]>Chris:

You could start with the collected works of Satoshi Kanazawa and Brian Wansink, and at least one paper by Daryl Bem, and some back issues of PPNAS.

]]>Thank you, A carrot pancake (and Andrew too). I cracked up when I saw the photo with this post; it works.

Thank you for bringing up the Schnittke cello sonata. It is beyond me technically, but I love the Gutman/Lobanov recording.

]]>Sorry, I’m dictating this text, and the software wrote embrace the uncertainty, when I said embrace uncertainty.

]]>And of course thanks to Andrew for a good post. I got a bit derailed. “Ribet”, says the frog-like (froggy) creature.

]]>