Skip to content

Revised statistical standards for evidence (comments to Val Johnson’s comments on our comments on Val’s comments on p-values)

As regular readers of this blog are aware, a few months ago Val Johnson published an article, “Revised standards for statistical evidence,” making a Bayesian argument that researchers and journals should use a p=0.005 publication threshold rather than the usual p=0.05.

Christian Robert and I were unconvinced by Val’s reasoning and wrote a response, “Revised evidence for statistical standards,” in which we wrote:

Johnson’s minimax prior is not intended to correspond to any distribution of effect sizes; rather, it represents a worst case scenario under some mathematical assumptions. Minimax and tradeoffs do well together, and it is hard for us to see how any worst case procedure can supply much guidance on how to balance between two different losses. . . .

We would argue that the appropriate significance level depends on the scenario and that what worked well for agricultural experiments in the 1920s might not be so appropriate for many applications in modern biosciences . . .

PNAS also published comments from Jean Gaudart, Laetitia Huiart, Paul Milligan, Rodolphe Thiebaut, and Roch Giorgi (“Reproducibility issues in science, is P value really the only answer?“) and Luis Pericchi, Carlos Pereira, and María-Eglée Pérez (“Adaptive revised standards for statistical evidence“), along with Johnson’s reply to all of us.

Val Johnson and I agree

Before getting to my disagreements with what Val wrote, I’d like to emphasize the important area where we agree. We both feel strongly dissatisfied with the existing default approach of scientific publication in which (a) statistical significance at the p=0.05 level is required for publication, and (b) results which are published and achieve p=0.05 are considered to be correct.

Val’s approach is to apply a minimax argument leading to a more stringent p-value threshold, whereas I’d be more interested in not using p-values (or related quantities such as Bayes factors) as publication thresholds at all. But we agree that the current system is broken. And I think we also agree that thresholds for evidence should depend on scientific context. For example, Val proposes a general cutoff of p=0.005 but he also writes approvingly (I think) that “P value thresholds of 3 × 10−7 are now standard in particle physics.” Again, I don’t like using any p-value threshold but I agree with Val that the current p=0.05 thing is causing problems. (Indeed, in some settings, I think it’s fine to report evidence that does not even reach the 0.05 level, if the problem is important enough. We discussed this in the context of the flawed paper on the effects of coal heating in China, where I argued that (a) their claim of p=0.05 statistical significance was a joke, but (b) maybe their claims should still be published, despite their inconclusive nature, because of the importance of the topic).

In short, Val and I agree with the Bayesian arguments made by Berger and others that p=0.05 provides weaker evidence that is typically believed. Where we disagree is in what to do about this.

Val Johnson and I disagree

In his reply to Christian and me, Val writes:

Gelman and Robert’s letter characterizes subjective Bayesian objections to the use of more stringent statistical standards, arguing that significance levels and evidence thresholds should be based on “costs, benefits, and probabilities of all outcomes.” In principle, this is a wonderful goal, but in practice, it is impossible to achieve. In most hypothesis tests, unique and well-defined loss functions and prior densities do not exist. Instead, a plethora of vaguely defined loss functions and prior densities exist. . . . Thousands of scientific manuscripts are written each year, and eliciting these distinct loss functions and priors on a case-by-case basis, and determining how to combine them, is simply not feasible. . . .

Just to be clear here: Christian and I nowhere used the term “subjective” in our letter, and indeed I do not consider our reference to decision analysis to be subjective, at least not any more subjective than the choice of a probability of 1/20 that drives Val’s calculations. The 1/20 level is objective only in the sociological sense that it represents a scientific tradition.

Val’s second point is that well-defined loss functions are difficult to achieve. I agree, and indeed in my own work I have rarely worked with formal loss functions or performed formal decision analyses. I am happy to report posterior inferences along with the models on which they are based. But Val is wanting to do more than this. He is trying to set a universal threshold for statistical significance. I don’t think this makes sense for the reasons Christian and I discussed in our letter. Finally, Val writes of the difficulty of eliciting loss functions and priors for the “thousands of scientific manuscripts [that] are written each year.” Sure, but one could make the same argument regarding other aspects of a scientific experiment, such as the design of the experiment, rules for data exclusion and data analysis, and choice of what information to include in the analyses. In some settings, it will be difficult to elicit a data model too, but the statistical profession seems to have no problem requiring researchers to do it.

Val also replies to one of our specific comments in this way:

The characterization of uniformly most powerful Bayesian tests (UMPBTs) as minimax procedures is inaccurate. Minimax procedures are defined by minimizing the maximum loss that a decision maker can suffer. In contrast, UMPBTs are defined to maximize the probability that the Bayes factor in favor of the alternative hypothesis exceeds a specified threshold.

I don’t really understand what Val is saying here, but I will accept that the term “minimax” has a technical meaning which does not correspond to his procedure. In any case, I stand by what Christian and I wrote earlier (setting aside the particular word “minimax”) that we can’t see it making sense to work with a worst-case probability that, in this case, does not correspond to any sensible prior distribution.

In short, I respect that Val is working on an important problem, but (a) I don’t really think we can do anything with the numbers that come out of his worst-case approach, and (b) I don’t like the general approach of seeking a universal p-value threshold.


  1. West says:

    Tom Loredo in effectively summaries the problems with p-values and FWER/FDR from an astronomers perspective far better than I can.
    –> “For astronomers, a catalog is not just a report of final classifications of candidate sources. Rather, it is a starting point for further analysis and discovery, perhaps the most common goal being estimating population distributions. Catalogs produced by FDR control are ill-suited to this.”

    And while the effort to find a gold-plated generic significance threshold highlights the problems with current practice, I can’t help but feel its a bit of a wild goose chase. The tolerance for false-detections can vary so much due to so many factors that any rule of thumb is questionable.

  2. Nadia Hassan says:

    In terms of the question of tradeoffs, the reproducibility project has featured an intriguing finding. Hal Paschler and others have noted that few studies with core results in the .05 > p > .025 have been replicated. By nature, we are talking about a small sample, so it would likely be unwise to leap to conclusions until more data came in. That would seem to be in line with a potential payoff for raising the statistical significance threshold.

    The question for costs seems like it is still there. For example, extraversion is associated with relationship satisfaction. The weighted average correlation is only r=.06.

    Malouff, J. M., Thorsteinsson, E. B., Schutte, N. S., Bhullar, N., & Rooke, S. E. (2010). The five-factor model of personality and relationship satisfaction of intimate partners: A meta-analysis. Journal of Research in Personality, 44(1), 124-127.

    Some individual studies fail to find effects at the .005, .025, or even current threshold, but the effect exists.

Leave a Reply