“Are all significant p-values created equal?”

The answer is no, as explained in this classic article by Warren Browner and Thomas Newman from 1987. If I were to rewrite this article today, I would frame things slightly differently—referring to Type S and Type M errors rather than speaking of “the probability that the research hypothesis is true”—but overall they make good points, and I like their analogy to medical diagnostic testing.

19 thoughts on ““Are all significant p-values created equal?”

    • I confess that I have only skimmed this paper so far, but I was disturbed by some of the statements in the Discussion, e.g.

      Although it is difficult to assess the proportion of all tested null hypotheses that are actually true, if one assumes that this proportion is approximately one-half … The P values displayed in Fig. 3 presumably arise from two types of experiments: experiments in which a true effect was present and the alternative hypothesis was true, and experiments in which there was no effect present and the null hypothesis was true …


      Maybe these statements make sense if the database being used contained only tests between discrete alternatives and not tests of point null hypotheses, but I would be really, really surprised if that were the case: the data set comprised “855 t tests reported in articles from the 2007 issues of Psychonomic Bulletin & Review (PBR) and Journal of Experimental Psychology: Learning, Memory, and Cognition (JEP:LMC).” (from Wetzels R, et al. (2011) Statistical evidence in experimental psychology: An empirical comparison using 855 t tests. Perspect Psychol Sci 6(3):291-298 http://pcl.missouri.edu/sites/default/files/Wetzels:etal:2011.pdf …)

      (I also think pigs will fly shortly before ecologists (at least) decide to move from “significant/highly significant” thresholds of 0.05/0.1 to thresholds of 0.005/0.001 …)

      So now I have to sit down and read the paper to see if I can now override my skepticism and find out if there’s something really valuable there …

      • (Don’t bother with the Johnson paper, read the Browner & Newman.)

        There can be no doubt that requiring P<0.005 for claiming significance will reduce the rate of false positives, but that doesn't address the more important problem that scientists too readily give responsibility for inference to algorithms.

      • Hi Ben,

        FWIW, Deborah Mayo shares your concern with this kind of calculation: http://errorstatistics.com/2013/11/09/beware-of-questionable-front-page-articles-warning-you-to-beware-of-questionable-front-page-articles-i/

        Calculations along these lines have been cropping up in many venues recently (and apparently not so recently, in light of Bowner & Newman 1987). I confess I’d noted them with interest without really thinking about them critically. Like you and Deborah Mayo, now that I’m thinking about it I don’t quite see how these calculations make sense. It seems like the analogy to diagnostic testing of discrete alternatives just doesn’t hold in the context of point null hypotheses, except in a very loose sense (probably too loose to be very helpful). Which is a problem insofar as people are trying to use this kind of problematic calculation as a guide to reforming statistical practice.

        Perhaps I need to think more about it, and there are versions of this sort of calculation out there that don’t suffer from the issues you and Deborah Mayo identify…

        • Ben and Jeremy:

          As we all know

          The best should not become the enemy of the good.

          All models or wrong or more poetically, “there is the word butterfly and the butterfly, if you confuse the two people have the right to laugh at you.” (L Cohen)

          For those who have yet to grasp the finer points of measure theory these models of generating false positive claims can effectively reduce the wrongness in their thinking about these things.

          I think Andrew’s point about the S and M errors, is not so much it avoids point nulls (which adds to face validity) but it adds additional help in getting less wrong. If you are going to take this as evidence, how often do you get the direction wrong and how far off the truth are you likely to be.

        • K? O’Rourke:

          Sorry, you kind of lost me. Afraid I don’t see why critiquing a problematic calculation amounts to making the best the enemy of the good. I think there are many good reasons to be concerned about the usual statistical practice in many fields. I just question whether this particular sort of calculation is among those reasons.

          I respectfully disagree that a problematic calculation is helpful because it will alter the behavior of researchers in positive ways. I don’t think it’s a good idea to try to manipulate or trick researchers into behaving as you want them to behave! There is no shortage of compelling, valid ways to call attention to problematic statistical practices, which will resonate with researchers who aren’t familiar with the finer points of measure theory. I’m sure you aren’t suggesting that the only alternatives here are appeals to the finer points of measure theory and appeals to oversimplified, problematic calculations!

          Afraid I’m unclear what Andrew’s notion of type S and M errors has to do with the calculation Ben and I (and Deborah Mayo) are critiquing. I actually like the notion of type S and M errors–but I don’t see why this problematic calculation is an argument for adopting type S and M errors.

        • I just see it more as a half-full glass of water, in that it helps folks realise how selective publication and other research practices can be problematic.

          So agreeing with Andrew “but overall they make good points, and I like their analogy to medical diagnostic testing.” while suggesting technical _distractions_ can be left aside.

          p.s. really liked your video clip on reactions to criticism on your blog!

  1. That’s a wonderful paper. Thank you for pointing it out.

    It seems to me that a way around the ‘problem of p’ and the ‘insignificance of statistical significance’ may be to stop talking about hypotheses and instead concentrate more on estimation. (I don’t mean to just give up on p-values in favour of confidence intervals, so keep reading!)

    In many situations (most, I’d guess) scientists are interested in “how much” questions rather than questions about the truth and falsity of point or interval hypotheses. The p-value from a significance test points to a likelihood function (http://arxiv.org/abs/1311.0081) which displays the evidence in the data in a manner that supports estimation, and is more informative than the p-value alone. As the likelihood function is directly related to the post-data power curve of the test and can be fed into a Bayesian analysis (if an appropriate prior function is available), it fits perfectly into the inferential framework that Browner & Newman are suggesting we adopt.

    The hypothesis focus of conventional statistical analysis and commentary is probably a consequence of the all or none outcomes of Neyman-Pearsonian hypothesis tests and their type I and type II errors, and it is difficult to argue clearly and cogently for changes to common inferential practices using those terms. Thus Andrew’s suggested change to the framing by talking exclusively of type S and type M errors would be an excellent start towards changing from a dysfunction system of significant/not significant to a much more useful system where the supported parameter values are more or less. Browner & Newman call for careful consideration of biological plausibility, prior likelihood (a reference to Edwards’s preferred form of prior?), previous experience and knowledge of alternative scientific explanations before test results are interpreted is important. We need to acknowledge that such consideration is impossible where type I and II errors are the statistical end-points.

  2. Pingback: Friday links: the history of “Big Data” in ecology, inside an NSF panel, funny Fake Science, and more (UPDATEDx2) | Dynamic Ecology

  3. “What then is a P value? It is the likelihood of observing the study results under the assumption that the null hypothesis of no difference is true.”

    I have a real problem with an article purporting to explain p-values that gets the definition wrong in two non-overlapping ways in the second freaking paragraph.

    • A full awareness of what likelihood (equivalence classes of) functions actually are and how slight variations of them lose many of the good properties seems to not be known by many.

      And it is a very technical and obscure subject. (My external examiner was actually yelling at me about my choice of definition for marginal likelihood even though it was just choice among a few in the published literature.)

      Also think it would best just to ban its use and stick with posterior/prior i.e. prior conditioned on data divided prior – if its just a calculation device to get the posterior from the prior don’t say anything more about it.

      • I was actually giving them a pass on the term “likelihood” since the term in common parlance is a synonym of probability, and it’s clear that they intend that meaning rather than the statistical jargon term. I was referring to “observing the study results” (nope: should be “observing results as or more extreme than the study results”) and “assumption that the null hypothesis of no difference is true” (nope: “null hypothesis” is statistical jargon again; the substantive part is “no difference”, and this is not part of the definition of a p-value).

  4. Are all priors created equal?

    I found the analogy to clinical tests strange. Here sensitivity and specificity are validated with reference to a population. So if we know the age of the patient we should _condition_ on it. A prior seems an odd way of doing the calibration.

    Model test result as function of age and unobserved disease status. No do belief propagation and explaining away. I.e. if you tell me the result is positive, I raise probability of disease. But if you then tell me she is 30 I lower it.

  5. Pingback: Bayesian linear regression analysis without tears (R) | Statistical Reflections of a Medical Doctor

  6. Here’s real story in which relying on a p-value would not have been the best course of action. Decades ago I arranged for the calibration of a slow neutron test facility. Basically the calibration lab sends you gold foils which you expose to the neutrons for a measured time. The foils become slightly radioactive; you ship the foils back to the lab and it determines the degree of activation by a series of counting measurements. The statistical errors of counting measurements are well known.

    I repeated the calibration procedure a year and a half later. The two results disagreed by a little more than two standard deviations. THe later one was larger. What to think? Well, two sigma or so is improbable but not unduly so, and there could well have been other sources of error than just the counting statistics. We could average the two results, but still … two sigma.

    I did a literature source and found a paper that reported that this kind of neutron source would increase in strength over time, due to a buildup of radioactive daughter products from initial impurities in the source. The derivation of the equation for source growth seemed correct. I was able to track down the original impurity assay for our source.

    Plugging into the published equation, I found an expected significant amount of growth. Fitting our results onto the calculated curve, I found that the best fit had the earlier result below the curve and the later one above, but each was a little closer than one standard deviation. That suggested that the growth had been more than predicted, but possibly not since each value was within one sigma.

    There was no more information I could find to support any other growth rate. There was no error information about the initial impurity assay, the most likely source of error. So I fit the points to the predicted growth equation using waited least squares. For the future, we used the fitted growth curve. There was really nothing else to do.

    Had I relied on some p-value for the two calibrations, it would have been in the vicinity of 0.05, and I might have just used the average of the two. But that would not have been correct, given the later finding that source growth had to occur. The final solution was probably not really right either, but it certainly was much closer, and there was no way to improve the results – although following the source readings over time would have helped.

    I think that this experience represents a great many real-life cases. But many times one does not discover the equivalent of the growth equation. So it behooves us to be demanding yet humble about our conclusions.

Comments are closed.