Statistical evidence for revised standards

Posted on December 26, 2013 9:00 AM by Andrew

In response to the discussion of X and me of his recent paper, Val Johnson writes:

I would like to thank Andrew for forwarding his comments on uniformly most powerful Bayesian tests (UMPBTs) to me and his invitation to respond to them. I think he (and also Christian Robert) raise a number of interesting points concerning this new class of Bayesian tests, but I think that they may have confounded several issues that might more usefully be examined separately.

The first issue involves the choice of the Bayesian evidence threshold, gamma, used in rejecting a null hypothesis in favor of an alternative hypothesis. Andrew objects to the higher values of gamma proposed in my recent PNAS article on grounds that too many important scientific effects would be missed if thresholds of 25-50 were routinely used. These evidence thresholds correspond roughly to p-values of 0.005; Andrew suggests that evidence thresholds around 5 should continue to be used (gamma=5 corresponds approximately to a 5% significance test).

My proposal for raising the bar for the declaration of a significant finding stems from widespread concern over the reproducibility of scientific research and the fact that the use of an evidence threshold of 5 leads to false positive rates of approximately 20% when one-half of tested null hypotheses are true. My personal experience at a large cancer center suggests that many more than one-half of tested null hypotheses are true (i.e., far fewer than one-half of novel therapies prove effective). My experience seems typical of surveys of p-values reported in the scientific literature (e.g., Wetzels et al. (2011)). And if more than 50% of tested null hypotheses are true, the use of an evidence threshold of 5 is guaranteed to lead to false positive rates that are even higher than 20%.

Of course, the evidence threshold that one chooses for declaring a scientific discovery is an inherently subjective choice, and arguments that higher false positive rates should be accepted in order to avoid missing true effects can certainly be made. However, evidence against a false null hypothesis accrues exponentially fast, so I suspect that most decision-theoretic analyses of the optimal threshold to use for declaring a significant finding would lead to evidence thresholds that are substantially greater than 5 (and probably also greater 25). But the “optimal” choice of threshold certainly varies across disciplines and applications.

Regardless of the evidence threshold that one chooses, I think it is important to separate the choice of this threshold from the method that is used to determine the form of the alternative hypothesis that is specified. Concerning this decision, Andrew draws an analogy between UMPBTs and minimax methodology. I think this analogy is inappropriate for several reasons. Minimax methods are defined by minimizing the maximum loss that a decision maker can suffer. If a (single) loss function has been specified, and if a (single) subjective prior distribution on unknown parameters is available, then I agree that a Bayesian decision rule should be used. In most applications, however, a unique loss function/prior distribution combination does not exist. Instead, a plethora of loss functions and priors exist. In the trial of a new cancer treatment, for example, separate loss functions and prior distributions apply to each individual patient, to each of the patients’ families, to each of the patients’ treating physicians, to each participating medical center, to each of the pharmaceutical sponsors, to each of the patients’ insurance companies, to the regulatory agency(s) overseeing the trial, and even to the biostatistician(s) who designs the trial. For this reason, clinical trials and many other hypothesis tests are instead posed as significance tests.

UMPBTs are clearly not based on the specification of loss functions and so do not involve minimization of maximum loss. Instead of classifying these tests as minimax procedures, I think it is more appropriate to regard them as Bayesian analogs of most powerful tests (hence their name). Simply put, UMPBTs are the class of objective Bayesian hypothesis tests that maximize the probability that the Bayes factor in favor of the alternative hypothesis exceeds a specified threshold. Given that an evidence threshold has been agreed upon, the objection to UMPBTs on the basis that they lead to high rates of false negatives therefore seems misguided. If one decides to use an evidence threshold of, say 5, then the UMPBT based on gamma=5 is exactly the Bayesian test that maximizes the probability that an effect is detected. No other test, based on either a subjectively or objectively specified alternative hypothesis, is as likely to produce a Bayes factor that exceeds the specified evidence threshold. So if you’re worried about false negatives, then you should use the UMPBT. The only other way to increase the posterior probability of the alternative hypothesis is to adjust the prior probability assigned to it.

Since I am probably getting close to overstaying my welcome on Andrew’s blog, I would also like to make a short comment on the Jeffreys-Lindley paradox and the assertion that UMPBTs are subject to it. A paradox arises when a situation is viewed from the wrong perspective. The Jeffreys-Lindley paradox arises in the present context when frequentist asymptotic methods are naively used to analyze Bayesian tests involving large samples. That is, it occurs when a Bayesian testing procedure is examined as a sample size is increased, without considering the scientific rationale for collecting a larger sample. When a scientist designs an experiment in which a very large number of items are sampled, she is attempting to either (a) detect a very small effect size, or (b) gain overwhelming evidence in favor of one of the tested hypotheses. In case (a), it is reasonable to regard the evidence threshold as being fixed at a commonly used value (i.e., 5 or 50). Doing so implies that the difference between the parameter value that pertains under the null and alternative hypotheses defined using a UMPBT is small [i.e., O(n^(-1/2))]. But this presents no paradox because the scientist is, by assumption, intending to detect a small effect size. On the other hand, the assumptions inherent to case (b) imply that a higher evidence threshold has been specified and that the sample size was increased to accommodate this higher threshold. When the sample size is increased to obtain a higher degree of certainty regarding the validity of competing hypotheses, then the apparent paradox again disappears because the alternative hypothesis specified under the UMPBT does not collapse onto the null value. The behavior of UMPBTs in large sample settings thus seems quite consistent with scientific practice.

I hope these comments provide a useful alternative perspective on the issues discussed in Andrew’s commentary, and I would again like to thank him for his graciousness in both calling his commentary to my attention and his invitation to respond to it.

Here is my quick reply:

1. I am not “suggesting that evidence thresholds around 5 should continue to be used.” I don’t really believe in “evidence thresholds,” for two reasons. First, basic decision analysis tells us that decisions should be based on costs and benefits as well as probabilities. Second, evidence thresholds based on conventional priors do not represent posterior distributions that I’d want to use for decisions. I’ve already discussed in various places my problems with flat priors. But Val’s prior, which is a mixture of point masses, is even less like a representation of any actual distribution of effect sizes.

2. I do not think it is so helpful in most scientific settings to label null hypotheses as “true” or “false.” As we’ve discussed often enough on this blog, I’d much prefer to talk about Type S and Type M errors—that is, getting the sign of a comparison wrong, or overstating the magnitude of a comparison. Mistakes get published all the time—I’m with Val on that point—but I think it is helpful to to beyond the false-negative, false-positive thing.

3. Regarding the details of Val’s model, let me just say again that I’m not so interested in minimax or UMPBT probability calculations, because these probabilities come from a model that is so far from anything that could be plausible. Indeed, the models that Val are using are not intended to be plausible, they’re intended to give a rigorous bound. That’s fine, but I don’t see the much relevance of a one-sided bound when considering a tradeoff.

In summary, I agree with Val that there seems to be a real problem that p=0.05 empowers a lot of researchers to act as if they have scientific proof when what they really have is noise. And going to a 3-standard-error rule would make it harder. I have no doubt that the Satoshi Kanazawas and Daryl Bems of the world would continue to find statistical significance at a higher threshold, but it would be more work for them to do so, and maybe, at least their papers would come out a bit less frequently, giving the rest of us more time to spend on more important topics. As Val notes in his original paper, changing the threshold makes it only a little bit more difficult to identify large real effects, while putting a much greater burden on those researchers who are shuffling noise around. And I am also supportive of Val’s desire to put all this in a Bayesian framework. But I still think that, to make this really work, we’d want priors that possibly could represent distributions of actual effects; I don’t think minimax does the trick. In any case, though, now that Val’s ideas are out there, they provide a useful comparison point to anything that comes next, and I appreciate Val getting his ideas out there to move the conversation forward.

P.S. Further comments (which I agree with) here from X.

10 thoughts on “Statistical evidence for revised standards”

Mayo on December 26, 2013 11:34 AM at 11:34 am said:

I have made some informal comments as to why the new standard is unlikely to increase the replication probability of “real effects”, and Stephen Senn illuminates a confusion in asking after the Bayesian probability of replication (in his discussion with Stephen Goodman): https://errorstatistics.com/2013/12/19/a-spanos-lecture-on-frequentist-hypothesis-testing/#comments

Reply ↓
Pingback: Somewhere else, part 104 | Freakonometrics
question on December 29, 2013 12:10 AM at 12:10 am said:

In the case of cancer research, most likely every null hypothesis of the form “two groups are exactly the same for this parameter” are false. There should be some justification for choosing that as the null hypothesis. I would agree that many of the effects observed could just as well be attributed to many other factors rather than the one under study, even if no one has noticed what they could be.

Reply ↓
question on December 29, 2013 12:15 AM at 12:15 am said:

A second point would be that it makes no sense there is a significance threshold that prevents publishing results. If a study was worth funding it should be worth telling others what happened. A single study should not be the basis for declaring a scientific discovery, although it may point in that direction. That idea goes all the way back to Fisher.

Reply ↓
- Andrew on December 29, 2013 4:51 AM at 4:51 am said:
  
  +1
  
  Reply ↓
- Shravan Vasishth on December 29, 2013 5:39 AM at 5:39 am said:
  
  Agreed. But how is the culture of the threshold going to be changed? This discussion seems vacuous and pointless (maybe OK over a glass of beer) if no real change is initiated.
  
  Reply ↓
  - question on December 29, 2013 6:16 AM at 6:16 am said:
    
    Shravan,
    
    For me (medical research) I started by seeing the huge amount of individual variability and questioning why this was treated as noise since it seemed to be the most interesting aspect of the data.
    
    I eventually began reading computational biology papers. While many authors are far too credulous regarding the experimental evidence there is an obvious lack of focus on p-values and group comparisons. Instead the researchers look for patterns in the data, compare it to patterns in data from previous studies (sometimes from different fields) and guess at the process that could explain this.
    
    They then do their best to create models that are consistent with the data and some theory they have guessed. I expect that over time advances in these fields (along with the ease of information flow over the internet) will allow researchers to learn these tools with less effort. I hope the result will be that the idea of designing experiments to overcome an arbitrary threshold when comparing the averages of two groups will be naturally selected against due to its ineffectiveness.
    
    My impression after investigating the method used by researchers before 1940 or so also fits this mold although they had much poorer tools than available now.
    
    Reply ↓
    - Shravan Vasishth on December 29, 2013 11:37 AM at 11:37 am said:
      
      Thanks for the clarification, question. Can you point me (and other readers on this blog) to one or two representative papers in computational biology that one can read to learn about published work that looks at computational models and goes beyond p-values? I work in linguistics, closely related to cognitive psychology. I also build computational models and test predicted patterns against data. In our field, even evaluation of models’ predictions against data patterns cannot escape the 0.05 threshold.
    - question on December 29, 2013 3:48 PM at 3:48 pm said:
      
      Sorry, the reply is in the wrong spot. You can find it below.
question on December 29, 2013 3:47 PM at 3:47 pm said:

Shravan,

I think the key change in thought is getting away from minimizing the error of a model fit and instead worrying more about the generative process.

See the discussion in these papers and the comments:

Beyond curve fitting: a dynamical systems account of exponential learning in a discrete timing task.
Liu YT, Mayer-Kress G, Newell KM. J Mot Behav. 2003 Jun;35(2):197-207.
https://www.ncbi.nlm.nih.gov/pubmed/12711589

Beyond Curve Fitting to Inferences About Learning
Yeou-Teh Liu, Gottfried Mayer-Kress & Karl M. Newell Journal of Motor Behavior Volume 36, Issue 2, 2004
http://www.tandfonline.com/doi/abs/10.3200/JMBR.36.2.233-238

Here is an example of a paper I was thinking of. I am not claiming their model is correct, only that the way they go about the research differs from the p-value threshold approach:

A multiplicative reinforcement learning model capturing learning dynamics and interindividual variability in mice.
Bathellier B, Tee SP, Hrovat C, Rumpel S. Proc Natl Acad Sci U S A. 2013 Dec 3;110(49):19950-5.
https://www.ncbi.nlm.nih.gov/pubmed/24255115

Here is another good one:
Cuntz H, Forstner F, Borst A, Häusser M (2010) One Rule to Grow Them All: A General Theory of Neuronal Branching and Its Practical Application. PLoS Comput Biol 6(8): e1000877. doi:10.1371/journal.pcbi.1000877
https://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000877

Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Statistical evidence for revised standards

10 thoughts on “Statistical evidence for revised standards”

Leave a Reply Cancel reply