Uh oh, this is getting kinda embarrassing.

The Garden of Forking Paths paper, by Eric Loken and myself, just appeared in American Scientist. Here’s our manuscript version (“The garden of forking paths: Why multiple comparisons can be a problem, even when there is no ‘fishing expedition’ or ‘p-hacking’ and the research hypothesis was posited ahead of time”), and here’s the final, trimmed and edited version (“The Statistical Crisis in Science”) that came out in the magazine.

Russ Lyons read the published version and noticed the following sentence, actually the second sentence of the article:

Researchers typically express the confidence in their data in terms of p-value: the probability that a perceived result is actually the result of random variation.

How horrible! Russ correctly noted that the above statement is completely wrong, on two counts:

1. To the extent the p-value measures “confidence” at all, it would be confidence in the null hypothesis, not confidence in the data.

2. In any case, the p-value is not not not not not “the probability that a perceived result is actually the result of random variation.” The p-value is the probability of seeing something at least as extreme as the data, if the model (in statistics jargon, the “null hypothesis”) were true.

**How did this happen?**

The editors at American Scientist liked our manuscript but it was too long, also parts of it needed explaining for a nontechnical audience. So they cleaned up our article and added bits here and there. This is standard practice at magazines. It’s not just Raymond Carver and Gordon Lish.

Then they sent us the revised version and asked us to take a look. They didn’t give us much time. That too is standard with magazines. They have production schedules.

We went through the revised manuscript but not carefully enough. *Really* not carefully enough, given that we missed a glaring mistake—*two* glaring mistakes—in the very first paragraph of the article.

This is ultimately not the fault of the editors. The paper is our responsibility and it’s our fault for not checking the paper line by line. If it was worth writing and worth publishing, it was worth checking.

**P.S.** Russ also points out that the examples in our paper all are pretty silly and not of great practical importance, and he wouldn’t want readers of our article to get the impression that “the garden of forking paths” is only an issue in silly studies.

That’s a good point. The problems of nonreplication etc affect all sorts of science involving human variation. For example there is a lot of controversy about something called “stereotype threat,” a phenomenon that is important if real. For another example, these problems have arisen in studies of early childhood intervention and the effects of air pollution. I’ve mentioned all these examples in talks I’ve given on this general subject, they just didn’t happen to make it into this particular paper. I agree that our paper would’ve been stronger had we mentioned some of these unquestionably important examples.

But Stereotype Threat is one of the sacred cows of contemporary social science. It’s one of the half dozen or so most popular ways for Nice People to reconcile in their heads why most data they observe in the world don’t look like their dogmas demand.

So, why get people very, very angry at you for calling it into question in a paper about theory?

Steve:

Nobody got angry at me for calling stereotype threat into question. I don’t really know anything about stereotype threat, I’ve just mentioned it on the blog a couple times.

Oh! But now you just upset Steve because the world didn’t get angry at you like *his* dogma demands! Maybe he was expecting you to get fired for it too.

Rahul:

It’s been nearly 20 years since I got fired because of someone’s dogma.

What exactly happened?

Buncha doctrinaire frequentists in Berkeley’s stats department “mugged” him — I’m assuming, up to and including the withholding of tenure.

“It couldn’t happen to me, because I’m not a Bad Person.”

“If it happened once it must be happening always.”

Are you happy with how the rest of the article turned out?

“….of seeing something at least as extreme as the data, if the model were true.”

Naive question: Model were true or model were *false*? Isn’t it “…if the null hypothesis were true”?

Rahul, I had the same confusion. In any case, models are simplifications, and so, strictly speaking, always false.* Hypotheses are claims about unknown and generally unknowable parameters and their relationships.

Maynard Smith’s point:

“I think it would be a mistake, however, to stick too rigidly to the criterion of falsifiability when judging theories in population biology. For example, Volterra’s equations for the dynamics of predator prey species are hardly falsifiable. In a sense they are manifestly false, since they make no allowance for age structure, for spatial distribution, or for the many other necessary features of real situations. Their merit is to show us that even a simple model for such an interaction leads to a sustained oscillation, a conclusion that would be hard to reach from purely verbal reasoning.”

(Maynard Smith, Evolution and the Theory of Games, 1982, p9.)

Yes, p values are clearly useless for nearly all scientific purposes using that definition. Either the model is not expected to be literally true, or it is accepted by pretty much everyone to be false (the most common strawman-type NHST). Attempts to explain why they sometimes seem to agree with intuition has lead to many myths and much confusion. Apparently p-value + sample size indexes a likelihood:

http://arxiv.org/abs/1311.0081

So P values can be OK if interpreted correctly (as short-hand for a likelihood function), significance testing not.

Rahul,

I think Andrew means “model under the null were true”

Thank you, Rahul. Andrew, would you mind please editing your post accordingly, if your phrase “if the model were true” is indeed a slip? It’s definitely confusing and not helpful (sorry to say), given your intention to correctly explain the p-value concept.

Done.

Thanks for having the courage to admit this (pretty minor mistake) so directly. People are rarely willing to admit mistakes and I think science would progress a lot faster if they did — really inspiring to see a high-profile and successful researcher who is willing to do so.

The early childhood intervention article as an example but not really in the way you imply i that post, since, as I pointed out in the comments, they actually report the same results as in the preprint, but just emphasize the more conservative results in the published article. Yes that research is an example of how you can have two different ways to operationalize the dependent variable (“all jobs” (smaller effect) “full time non temporary work” (larger effect)). As I mentioned I’d suspect that given that 22% of those in treatment are still in school they may be more likely to have temporary part time jobs than the 4% in the comparison who were still in school. However I think it is reasonable and cautious of the authors to focus on the smaller effect size result, not to mention that their reviewers may have required that. To me the still in school outcome, even if it was never considered as part of the original study design, is the most interesting result. 22 years of age is way too early to be characterizing adult earnings for students who attend college.

Elin:

The issue I was raising with the early childhood intervention study was not that they changed their decision of which outcome measure to focus on, but rather the statistical-significance-filter and play-the-winner biases that arose in their estimates and in their statements of statistical significance.

Publish an erratum. I know that American Scientist will do that, Jim Berger and I did it in a paper we published many years ago.

Errata are about the best you can do as a CYA option. But of course most interested readers will never see them…

The core problem is the very superficial nature of many ms reviews — as well as superficial knowledge of the historical statistical literature by many statisticians.

You seem to have company in creating confusion. This is the lead for David Leonhardt’s piece in the New York Times:

“Now that The Upshot puts odds of a Republican takeover of the Senate at 74 percent, we realize that many people will assume we’re predicting a Republican victory. We’re not.”

http://www.nytimes.com/2014/10/16/upshot/how-not-to-be-fooled-by-odds.html?_r=0&abt=0002&abg=0

Worse, it’s evidently not an editing mistake. We have a whole column here clarifying the fact that a 74% probability is not a prediction, because somehow predictions are supposed to be “something will almost certainly happen”.

Huh? If the National Weather Service forecast is for a 74% chance of rain, then haven’t they predicted a 74% chance of rain happening? And if The Upshot is stating the odds the Republicans will take over the Senate are 74%, isn’t that a similar prediction that something has a 74% chance of happening?

Since when is a probabilistic forecast not a prediction?

Zbicyclist:

I’d prefer if they were to say “70%” rather than “74%.” The National Weather Service does not, to my knowledge, say there’s a 74% chance of rain because such a number would be meaninglessly precise. We discussed this point a couple years ago in the context of Nate Silver’s statement that Obama had a “65.7%” chance of winning.

Saying “74%” is better than saying “73.8%,” but I think that it would even better to follow the lead of the National Weather Service and say “70%.”

– “To the extent the p-value measures “confidence” at all, it would be confidence in the null hypothesis, not confidence in the data.”

On the other hand, you often say that a p-value is a measure of sample size. Then, couldn’t a measure of sample size say something about “confidence in the data”, in a way? Of course ‘confidence’ is a loaded term in statistics, due to the idea of level of confidence. But in non-technical writings, words can be stretched.

– “not not not “the probability that a perceived result is actually the result of random variation.”

Again, “perceived result” can be interpreted as the deterministic trend of the fitted model; and “random variation” can be understood as random variation under the null model. Ok that’s a bit more of a stretch because in the end the probability is only about the tail area of the distribution of the test statistic, but at least in this way not all the concepts are tormented.

A quick fix would be to say : “the probability that a perceived result at least this extreme is the result of random variation.”

[…] just read Andrew Gelman’s post about an article with his name on it starting with a bogus definition of p-value. I sympathize with all parties. Journalists and editors are just trying to reduce technical terms […]

Errata are about the best you can do as a CYA option. But of course most interested readers will never see them…

The core problem is the very superficial nature of many ms reviews — as well as superficial knowledge of the historical statistical literature by many statisticians.

I want to cite your article, and having found the American Scientist version, thought that would be better than the unpublished pdf, but, given the errors introduced by editors, it seems the pdf is a better choice.

I assume the editors also cut out the literary allusion to the Borges story of the Garden of Forking Paths, which is such a nice one and something I want to develop, so I am staying with the pdf.

[…] is small. The article contains cites to studies supporting these points and Gelman blogged about it here. Part of their argument is that in small samples with noise, the standard errors will be higher, […]