I would like to thank Andrew for forwarding his comments on uniformly most powerful Bayesian tests (UMPBTs) to me and his invitation to respond to them. I think he (and also Christian Robert) raise a number of interesting points concerning this new class of Bayesian tests, but I think that they may have confounded several issues that might more usefully be examined separately.
The first issue involves the choice of the Bayesian evidence threshold, gamma, used in rejecting a null hypothesis in favor of an alternative hypothesis. Andrew objects to the higher values of gamma proposed in my recent PNAS article on grounds that too many important scientific effects would be missed if thresholds of 25-50 were routinely used. These evidence thresholds correspond roughly to p-values of 0.005; Andrew suggests that evidence thresholds around 5 should continue to be used (gamma=5 corresponds approximately to a 5% significance test).
My proposal for raising the bar for the declaration of a significant finding stems from widespread concern over the reproducibility of scientific research and the fact that the use of an evidence threshold of 5 leads to false positive rates of approximately 20% when one-half of tested null hypotheses are true. My personal experience at a large cancer center suggests that many more than one-half of tested null hypotheses are true (i.e., far fewer than one-half of novel therapies prove effective). My experience seems typical of surveys of p-values reported in the scientific literature (e.g., Wetzels et al. (2011)). And if more than 50% of tested null hypotheses are true, the use of an evidence threshold of 5 is guaranteed to lead to false positive rates that are even higher than 20%.
Of course, the evidence threshold that one chooses for declaring a scientific discovery is an inherently subjective choice, and arguments that higher false positive rates should be accepted in order to avoid missing true effects can certainly be made. However, evidence against a false null hypothesis accrues exponentially fast, so I suspect that most decision-theoretic analyses of the optimal threshold to use for declaring a significant finding would lead to evidence thresholds that are substantially greater than 5 (and probably also greater 25). But the “optimal” choice of threshold certainly varies across disciplines and applications.
Regardless of the evidence threshold that one chooses, I think it is important to separate the choice of this threshold from the method that is used to determine the form of the alternative hypothesis that is specified. Concerning this decision, Andrew draws an analogy between UMPBTs and minimax methodology. I think this analogy is inappropriate for several reasons. Minimax methods are defined by minimizing the maximum loss that a decision maker can suffer. If a (single) loss function has been specified, and if a (single) subjective prior distribution on unknown parameters is available, then I agree that a Bayesian decision rule should be used. In most applications, however, a unique loss function/prior distribution combination does not exist. Instead, a plethora of loss functions and priors exist. In the trial of a new cancer treatment, for example, separate loss functions and prior distributions apply to each individual patient, to each of the patients’ families, to each of the patients’ treating physicians, to each participating medical center, to each of the pharmaceutical sponsors, to each of the patients’ insurance companies, to the regulatory agency(s) overseeing the trial, and even to the biostatistician(s) who designs the trial. For this reason, clinical trials and many other hypothesis tests are instead posed as significance tests.
UMPBTs are clearly not based on the specification of loss functions and so do not involve minimization of maximum loss. Instead of classifying these tests as minimax procedures, I think it is more appropriate to regard them as Bayesian analogs of most powerful tests (hence their name). Simply put, UMPBTs are the class of objective Bayesian hypothesis tests that maximize the probability that the Bayes factor in favor of the alternative hypothesis exceeds a specified threshold. Given that an evidence threshold has been agreed upon, the objection to UMPBTs on the basis that they lead to high rates of false negatives therefore seems misguided. If one decides to use an evidence threshold of, say 5, then the UMPBT based on gamma=5 is exactly the Bayesian test that maximizes the probability that an effect is detected. No other test, based on either a subjectively or objectively specified alternative hypothesis, is as likely to produce a Bayes factor that exceeds the specified evidence threshold. So if you’re worried about false negatives, then you should use the UMPBT. The only other way to increase the posterior probability of the alternative hypothesis is to adjust the prior probability assigned to it.
Since I am probably getting close to overstaying my welcome on Andrew’s blog, I would also like to make a short comment on the Jeffreys-Lindley paradox and the assertion that UMPBTs are subject to it. A paradox arises when a situation is viewed from the wrong perspective. The Jeffreys-Lindley paradox arises in the present context when frequentist asymptotic methods are naively used to analyze Bayesian tests involving large samples. That is, it occurs when a Bayesian testing procedure is examined as a sample size is increased, without considering the scientific rationale for collecting a larger sample. When a scientist designs an experiment in which a very large number of items are sampled, she is attempting to either (a) detect a very small effect size, or (b) gain overwhelming evidence in favor of one of the tested hypotheses. In case (a), it is reasonable to regard the evidence threshold as being fixed at a commonly used value (i.e., 5 or 50). Doing so implies that the difference between the parameter value that pertains under the null and alternative hypotheses defined using a UMPBT is small [i.e., O(n^(-1/2))]. But this presents no paradox because the scientist is, by assumption, intending to detect a small effect size. On the other hand, the assumptions inherent to case (b) imply that a higher evidence threshold has been specified and that the sample size was increased to accommodate this higher threshold. When the sample size is increased to obtain a higher degree of certainty regarding the validity of competing hypotheses, then the apparent paradox again disappears because the alternative hypothesis specified under the UMPBT does not collapse onto the null value. The behavior of UMPBTs in large sample settings thus seems quite consistent with scientific practice.
I hope these comments provide a useful alternative perspective on the issues discussed in Andrew’s commentary, and I would again like to thank him for his graciousness in both calling his commentary to my attention and his invitation to respond to it.
Here is my quick reply:
1. I am not “suggesting that evidence thresholds around 5 should continue to be used.” I don’t really believe in “evidence thresholds,” for two reasons. First, basic decision analysis tells us that decisions should be based on costs and benefits as well as probabilities. Second, evidence thresholds based on conventional priors do not represent posterior distributions that I’d want to use for decisions. I’ve already discussed in various places my problems with flat priors. But Val’s prior, which is a mixture of point masses, is even less like a representation of any actual distribution of effect sizes.
2. I do not think it is so helpful in most scientific settings to label null hypotheses as “true” or “false.” As we’ve discussed often enough on this blog, I’d much prefer to talk about Type S and Type M errors—that is, getting the sign of a comparison wrong, or overstating the magnitude of a comparison. Mistakes get published all the time—I’m with Val on that point—but I think it is helpful to to beyond the false-negative, false-positive thing.
3. Regarding the details of Val’s model, let me just say again that I’m not so interested in minimax or UMPBT probability calculations, because these probabilities come from a model that is so far from anything that could be plausible. Indeed, the models that Val are using are not intended to be plausible, they’re intended to give a rigorous bound. That’s fine, but I don’t see the much relevance of a one-sided bound when considering a tradeoff.
In summary, I agree with Val that there seems to be a real problem that p=0.05 empowers a lot of researchers to act as if they have scientific proof when what they really have is noise. And going to a 3-standard-error rule would make it harder. I have no doubt that the Satoshi Kanazawas and Daryl Bems of the world would continue to find statistical significance at a higher threshold, but it would be more work for them to do so, and maybe, at least their papers would come out a bit less frequently, giving the rest of us more time to spend on more important topics. As Val notes in his original paper, changing the threshold makes it only a little bit more difficult to identify large real effects, while putting a much greater burden on those researchers who are shuffling noise around. And I am also supportive of Val’s desire to put all this in a Bayesian framework. But I still think that, to make this really work, we’d want priors that possibly could represent distributions of actual effects; I don’t think minimax does the trick. In any case, though, now that Val’s ideas are out there, they provide a useful comparison point to anything that comes next, and I appreciate Val getting his ideas out there to move the conversation forward.
P.S. Further comments (which I agree with) here from X.