Hey, good news! Your p-value just passed the 0.05 threshold!

E. J. Wagenmakers writes:

Here’s a link for you. The first sentences tell it all:

Climate warming since 1995 is now statistically significant, according to Phil Jones, the UK scientist targeted in the “ClimateGate” affair. Last year, he told BBC News that post-1995 warming was not significant–a statement still seen on blogs critical of the idea of man-made climate change. But another year of data has pushed the trend past the threshold usually used to assess whether trends are “real.”

Now I [Wagenmakers] don’t like p-values one bit, but even people who do like them must cringe when they read this. First, this apparently is a sequential design, so I’m not sure what sampling plan leads to these p-values. Secondly, comparing significance values suggests that the data have suddenly crossed some invisible line that divided nonsignificant from significant effects (as you pointed out in your paper with Hal Stern). Ugh!

I share Wagenmakers’s reaction. There seems to be some confusion here between inferential thresholds and decision thresholds. Which reminds me how much I hate the old 1950s literature (both classical and Bayesian) on inference as decision, loss functions for estimators, and all the rest. I think the p-value serves a role in summarizing certain aspects of a model’s fit to data but I certainly don’t think it makes sense as any kind of decision threshold (despite that it is nearly universally used as such to decide on acceptance of research in scientific journals).

21 thoughts on “Hey, good news! Your p-value just passed the 0.05 threshold!

  1. Andy, I share your skepticism about using p-values as inferential thresholds. But I'm curious why that rules out treating inference as a decision theoretic problem. When statistics is playing a descriptive role, just presenting (e.g.) posterior probability distributions makes sense. But when empirical evidence is being used for theory testing, we do have to make some kind of decision about whether the evidence is supportive of the theory or not (e.g., whether predicted hypotheses are confirmed). We could put the decision on a sliding scale, but it would still be nice to have that scale made clear.

    My take is that decision theory makes clear and transparent that which would ordinarily be murky and subjective, viz., how supportive evidence is of a theoretical prediction. This is also why I don't understand why people rule out decision theory because of disagreement over a loss function; the alternative, it strikes me, is just to have the same disagreement but over vague and unspecified loss functions.

    Full disclosure: I have skin in this game. <a href="http://(http://use rwww.service.emory.edu/~jesarey/riskstats.pdf)” target=”_blank”>(http://use <a href="http://rwww.service.emory.edu/~jesarey/riskstats.pdf)” target=”_blank”>rwww.service.emory.edu/~jesarey/riskstats.pdf)

  2. The discussion of the newsarticle – which was written by a journalist – aside, perhaps Jones' point was simply to point out that the "not statistically significant" at the 5% level argument no longer holds and not to suggest that the inference is somehow more believable this year than it was last.

  3. 1) The general point is certainly right. When I took my first statistics course 45 years ago, I always thought the sudden jump from not significant to significant at X% made no sense at all in the real world for making decisions.

    2) But there is more to this particular story. Phil Jones is a very straightforward guy who only recently is getting accustomed to the rough-and-tumble of dealing with the press.

    3)The original interview included a trap question, having purposefully picked the longest interval for which the result was not significant at the 95% level, and framed as shown below.

    4) Some good analyses from "tamino" (whose actual identity is obvious from his website; he does a lot of time-series work) include:

    Phil Jones was wrong.

    "During an interview for the BBC he was asked, “Do you agree that from 1995 to the present there has been no statistically-significant global warming?” Jones replied, “Yes, but only just.”

    Tamino followed up with:
    Loaded questions.
    People discuss better ways to handle the question.

    5) But how did the BBC person happen to ask this exact question?
    Deep Climate has a good analysis. Obviously we don't know for sure, but the connections seem likely.

    Basically, Phil was asked a carefully-loaded question in an interview, and he could have responded better, but I don't think he had yet gained the experience of dealing with such things. Most scientists tend to be straightforward with the press, and relatively few are really experienced at such. (For example, this is why we all miss Steve Schneider.)

  4. CLIMATEGATE 101: "Don't leave stuff lying around on ftp sites – you never know who is trawling them. The two MMs have been after the CRU station data for years. If they ever hear there is a Freedom of Information Act now in the UK, I think I'll delete the file rather than send to anyone….Tom Wigley has sent me a worried email when he heard about it – thought people could ask him for his model code. He has retired officially from UEA so he can hide behind that." – Phil Jones

  5. Justin:

    I think decision analysis is great. We have a chapter on decision analysis in Bayesian Data Analysis (chapter 22, I believe). I just think decision analysis should be tied to actual costs and benefits rather than to abstract concepts such as squared error loss, minimax regret, and other bad ideas (in my opinion) from the 1950s.

  6. Careful cherry-picking of intervals shows up all the time, as people move the start date back and forth to get the results they want, generally "no warming" or "no statistical significance."

    This doesn't do significance, just plots N-year linear regressions by ending year (yes, I know one can argue for centered). Result for N=5, 10,15,30 is here.
    Short intervals are spectacularly dependent on choice of end year. Of course, at 30 years (the usual interval climate scientists mention, given the year-to-year noise), it's been 30 years since there was a 30-year-period with a non-positive SLOPE.

  7. This whole thing is ridiculous…by which I mean the idea that it makes sense to compare the data to a model in which there is no trend. It's like people think the field of physics is worthless and all we can trust is empirical observation.

    I am a big fan of empirical observation! It's very important! Nobody should blindly believe a (necessarily) simplified physics model of a complicated system (such as the climate) and it is very important to look at the data. But it is nonsense to compare the data to a null result. It's like weighing your baby every day and waiting until the gain is "statistically significant" before you conclude that s/he is growing. If you are going to do any kind of significance test, compare the data to your model.

  8. Elaine: Competent clinical trialists are well aware that it's misleading to simply keep going until a naive p-value dips below (say) 0.05. For what they do instead, see the Sequential Clinical Trials literature.

  9. I really hate this line of debate, and I don't think it helps the community at all. In certain circles, the word "p-value" is taboo. There are problems with p-values just like there are problems with most methodologies. It is really a low-ball tactic to tie this debate to the global warming issue. Most scientifically-minded people believe there is global warming, and our belief is not due to statistical significance, and certainly not significance alone, or even significance primarily. It's sad that Wagenmakers tries to advance his campaign against "p-value" by sowing doubt on the science of climate change.

    Where can I see a coherent argument for how to make a practical decision using probability distributions or confidence intervals for that matter? By practical, I mean a binary decision has to be made and a consensus must be achieved between multiple decision makers with conflicting objectives.

    How do we deal with imprecision as it relates to cost and benefit estimates? How do we deal with wide intervals? How do we deal with varying tolerance for errors amongst the decision makers? How do we deal with biased decision makers who set their error tolerance based on what decision they would like to see made?

    Most importantly, in these real-world settings, when statisticians walk into the room with probabilities and (presumably) a recommended decision, how often do we walk out of the room having achieved consensus on that recommendation? Can we really influence decisions without some kind of thresholds or rules? How would we otherwise support the recommended decision?

  10. Kaiser:

    I think p-values can be a useful way to summarize misfit of model to data (see chapter 6 of Bayesian Data Analysis), and I agree that decisions often need a threshold, but I don't think the p-value is the right way to determine the threshold.

    Also, I don't know Wagenmakers but I have no reason to think that he is sowing doubt on the science of climate change. As Phil notes, the science of climate change is based on lot more than this particular short time series.

  11. Andrew: I agree that p-value is often not the way to go about it, and confidence intervals are much more informative. I do worry that our community needs to have some consistent methodology to derive point decisions from distributions.

  12. It's funny seeing all this sort of criticism aimed at global warming stuff. The global warming games of this nature barely begin to TOUCH what's been going in the Great Antismoking Crusade for decades.

    The EPA Report that launched so many of the smoking bans by stoking public fear in the US not only cherry picked the studies to include, but then, when the meta-analysis didn't meet the 95% criterion they stated they could use a one-tailed analysis since they "expected" the results would lie in a particular direction. Thus they got their "significant threat" of an extra one lung cancer in a thousand after 40 years of constant daily exposure to secondhand smoke.

    Of course all the public heard was that "If you are exposed to someone else's smoke you have a 20% increased chance of dying from lung cancer.

    Also: the trick referred to in this statement: "Careful cherry-picking of intervals shows up all the time, as people move the start date back and forth to get the results they want, generally "no warming" or "no statistical significance." " was used quite nicely in the study making the claim that Big Tobacco was steadily increasing the amount of nicotine in cigarettes to "addict new smokers." I don't have it in front of me at the moment, but basically they graphed something like 1998 through 2006 and plotted a line with a small but clearly increasing slope.

    It wasn't till later on when some of us more skeptical folks did some investigating that it turned out they seemed to have deliberately lopped off the figures for the year before and the year after their graph. If those years were included the rate of increase in nicotine was… Zero!

    And finally, folks here might be interested in some analysis I did of a study that made headlines all over the world claiming "Smoking bans do not hurt bar and restaurant employment." The trick there was that the researchers took two discrete data sets, bar & restaurant, saw the bars lost 11% after a ban, and then combined them with restaurants to show a "non-significant" loss and came up with that carefully worded "and" in their headline and statements. Full story on that in the comments at:


    Sooo… you can see why I find it a bit funny that the climate stat crits get so much attention.

    Michael J. McFadden
    Author of "Dissecting Antismokers' Brains"

  13. Mr. McFadden, I'm no fan of making decisions based on p-values, but I will say that if you are going to pay attention to a p-value for cancer risk from secondhand smoke then a one-tailed test makes a lot more sense than a two-tailed test. And, as with first-hand smoking, lung cancer is not the only risk from secondhand smoke: it also causes increased bronchitis and other lung problems. All of the above reached levels of "statistical significance" at the 5% level sometime back in the mid- to late-90s. I believe, but I am not sure, that the same is true of heart disease deaths.

    By the way, in case readers are wondering, some of the best early work in this area looked at the disease or death rates of nonsmoking women living with smoking husbands, and compared them to matched women living with nonsmoking husbands. There are the usual sorts of problems with how to do the matching, plus of course even nonsmoking women with nonsmoking husbands were exposed outside the home. I am not an expert in this area, but people I trust say that the studies were very convincing by the mid-90s and are completely so now.

    The current estimate of lung cancer risk suggests that long-term exposure to secondhand smoke (such as living with a smoker, or working at a full-time job in a smoky environment) increases risk by something like 10-30%. Since about 1.2% of never-smokers get lung cancer, this means that long-term exposure to secondhand smoke is estimated to cause something like 1 to 3 lung cancers per 1000 people; McFadden has chosen the lower end of that estimate, which is a plausible number but somewhat lower than the current best guess.

  14. Thank you for a reasonable response Phil! :) That's the sort of thing I was hoping for here: you should see some of the attacks I get! :>

    May I ask where you got the 1.2% figure from? My own estimate is based on the EPA figure of a 19% increase over a base rate of .4% for nonexposed nonsmokers.

    Re the one-tailed vs. two-tailed test: It wasn't *totally* unreasonable, but it's not something that epidemiologists usually apply elsewhere under similar circumstances. There have been a number of studies (about 10% I believe) showing a nonsignificant finding in the "protective" direction, and even a couple showing it significantly at the 95% level. And we know that there have been other unexpected findings for low levels of exposures to problem substances — I believe the best wisdom at the moment is that if you raise your children while trying to keep them in "bubble environment" that you end up with adults prone to all sorts of problems that their bodies never adapted to.

    My post wasn't so much to argue that particular point as it was to point out how statistics and statistical language has been so commonly misused in that area for social engineering purposes — I believe far more so than in the global warming debate (though I claim no expertise at all in the latter!)

    – MJM

  15. Just to be clear, Prof. Jones is not performing any sequential design, and if the context is made clear, there is no reason to cringe. Prof. Jones, like most climatologists use e.g. 30 year trends when looking at climate, as the internal variability of the climate (i.e. weather) is too large for secular variation to be reliably detectable over a timescale that is much shorter. Prof. Jones was asked a loaded question (based on a cherry picked start date) in a BBC interview and gave a perfectly correct, honest and straightforward answer (that the trend in question was not significant). To thank him for that, the media then misinterpreted his answer to imply that there was no warming (classic misinterpretation of a hypothesis test). In the more recent interview, he is merely pointing out that the trend is now significant (as he predicted in the first interview that it would be), to try and counter some of that mis-reporting.

    I very much doubt he is basing any scientific conclusion on this trend going from insignificant to significant. If you read the interview, your will find he doesn't draw any conclusion about climate from it, just re-iterate that you need even longer trends to reliably detect the trend.

    I think Wagenmakers ought to have read more than just the first few sentences and owes Prof. Jones an appology.

  16. The relationship between smoking and cancer had been established in the 1950s by the work of Richard Doll and others. Doll initially suspected that tarmac or car fumes were the culprit, but his study had a surprise.

Comments are closed.