Revised evidence for statistical standards

X and I heard about this much-publicized recent paper by Val Johnson, who suggests changing the default level of statistical significance from z=2 to z=3 (or, as he puts it, going from p=.05 to p=.005 or .001). Val argues that you need to go out to 3 standard errors to get a Bayes factor of 25 or 50 in favor of the alternative hypothesis. I don’t really buy this, first because Val’s model is a weird (to me) mixture of two point masses, which he creates in order to make a minimax argument, and second because I don’t see why you need a Bayes factor of 25 to 50 in order to make a claim. I’d think that a factor of 5:1, say, provides strong information already—if you really believe those odds. The real issue, as I see it, is that we’re getting Bayes factors and posterior probabilities we don’t believe, because we’re assuming flat priors that don’t really make sense. This is a topic that’s come up over and over in recent months on this blog, for example in this discussion of why I don’t believe that early childhood stimulation really raised earnings by 42%—and not because I think the study in question was horribly flawed (sure, it suffers from selection issues and more could be done in the analysis, but the same could be said of just about any observational study, including many if not all of mine) but because, fundamentally, a point estimate of 42% is a Bayes estimate of 42% if you have a flat prior, and I don’t have a flat prior, I think effects are typically much closer to zero.

Anyway, that’s all background. Val’s paper got enough attention that X and I thought it would be worth trying to clear the air about a couple of points, most notably where his 0.005 came from and how it could be interpreted.

Here’s what X and I wrote:

In his article, “Revised standards for statistical evidence,” Valen Johnson proposes replacing the usual p = 0.05 standard for significance with the more stringent p = 0.005. This might be good advice in practice but we remain troubled by Johnson’s logic because it seems to dodge the essential nature of any such rule, that it expresses a tradeoff between the risks of publishing misleading results and of important results being left unpublished. Ultimately such decisions should depend on costs, benefits, and probabilities of all outcomes.

Johnson’s minimax prior is not intended to correspond to any distribution of effect sizes; rather it represents a worst-case scenario under some mathematical assumptions. Minimax and tradeoffs do not play well together (Berger, 1985), and it is hard for us to see how any worst-case procedure can supply much guidance on how to balance between two different losses.

Johnson’s evidence threshold is chosen relative to a conventional value, namely Jeffreys’ target Bayes factor of 1/25 or 1/50, for which we do not see any particular justification except with reference to the tail-area probability of 0.025, traditionally associated with statistical significance.

To understand the difficulty of this approach, consider the hypothetical scenario in which R. A. Fisher had chosen p = 0.005 rather than p = 0.05 as a significance threshold. In this alternative history, the discrepancy between p-values and Bayes factors remains and Johnson could have written a paper noting that the accepted 0.005 standard fails to correspond to 200-to-1 evidence against the null. Indeed, a 200:1 evidence in a minimax sense gets processed by his fixed-point equation γ = exp[z*sqrt(2 log(γ)) − log(γ)] at the value γ = 0.005, into z = sqrt (-2 log(0.005)) = 3.86, which corresponds to a (one-sided) tail probability of Φ(−3.86), approximately 0.0005. Moreover, the proposition approximately divides any small initial p-level by a factor of sqrt(−4π log p), roughly equal to 10 for the p’s of interest. Thus, Johnson’s recommended threshold p = 0.005 stems from taking 1/20 as a starting point; p = 0.005 has no justification on its own (any more than does the p = 0.0005 threshold derived from the alternative default standard of 1/200).

One might then ask, was Fisher foolish to settle for the p = 0.05 rule that has caused so many problems in later decades? We would argue that the appropriate significance level depends on the scenario, and that what worked well for agricultural experiments in the 1920s might not be so appropriate for many applications in modern biosciences. Thus, Johnson’s recommendation to rethink significance thresholds seems like a good idea that needs to include assessments of actual costs, benefits, and probabilities, rather than being based on an abstract calculation.

X and I seem to be getting into a habit of writing “soft” papers (in particular, this little article, this book review, and our discussion and rejoinder on Feller), but in our defense let me point out that the above analysis does involve some algebra (yes, it’s pretty simple but we did a bunch of other calculations too, as usual these things look simple at the end only after some careful thinking went on earlier), also we are trying to do some real research as well (including some work on Bayes factors and posterior probabilities motivated by our conversations about Val’s paper).

X presents our discussion on his blog here.

1. John says:

Maybe we need to have separate standards for publishability and believeability. If we restrict scientists to only publishing results that are very likely to be correct, some fields would have to practically stop publishing papers.

Publications serve two purposes — an announcement of scientific results, and a record of professional activity — and these are in tension. Academics are not rewarded for being right, they’re rewarded for publishing. (Nobel prizes may be an exception. Peter Higgs got his prize because it appears he was correct. But he also almost lost his job for not publishing enough.)

• *Sigh* and this is exactly the kind of thing that makes me want to stay out of academia. I have such a love-hate relationship with academia, but I know that while I’m not necessarily likely to publish things as important as Peter Higgs, I’m sure as hell not going to be able to sleep at night if I am pumping out paper after paper about blather whose only real purpose is to up my publication score. My wife has had the same problem: she’s a very careful bio-scientist who takes the time and effort to get things really understood in a fundamental and correct way. This leads to the kind of career stress you are talking about.

The hedge funds that are large universities are also not really about “academic” performance as organizations. They’re chasing donations and bloating their administration as fast as possible to game a financial system in much the same way that banks were 5 to 10 years ago. (just think of “student loans” as the new “predatory mortgage”)

• Rahul says:

+1

• +2

… or you can go with it an publish things as a record of your activity even if you’re not as sure as you would like about being right?

• There are at least a couple problems with this:

1) you pollute the world with things like “lower salt diets reduce the risk of heart disease” followed by “lower salt diets do not have clear benefit for heart disease” followed by “lower salt diets increase risk of overall mortality” or…

2) You waste society’s time and money looking at things that have very little risk of causing harm when you’re wrong because they have very little importance for the world at large… such as “new dynamic model of iron transport in the ocean shows that previous dynamic model of iron transport in the ocean may have been off by a factor of 2 in small regions near the coast of Alaska”. Sure it’s of interest to a few people, and sure it might someday be relevant to some actual decision someone has to make, but in point of fact, both models are probably very wrong anyway especially since it takes enormous amounts of money to collect enough data to calibrate them, and they’re poorly specified to begin with.

The third option is to do good careful science that means something important for the world, like studying the actual physiological mechanisms and feedbacks associated with blood pressure regulation and the effect of diet on that process and things like that… publish 2 or 3 extremely valuable and correct papers a decade… if you make it past your 3 to 5 year mid-tenure review with no publications and no grants. This seems to appeal to people who like playing high-stakes lotteries or dealing drugs:

• Andrew says:

Daniel:

Different people have different styles. It wold kill me to only publish 2 or 3 papers a decade!

• Rahul says:

Problem is at the reader end. We’ve given a reader no easy way to distinguish between papers. An “extremely valuable & correct paper” looks the same superficially as crap churned out in a month.

What we sorely need is some flavor of reputational or quality validation metric.

• I think it’s not just different people, but different fields. I have experience in engineering/physics and biology/medicine. Particularly in biology/medicine it can take a big part of a decade to do a laboratory study or a series of clinical trials. Suppose it takes 5 years to carry out some studies in mice, and you get two good papers out of that, and then another 5 years to follow up with a new line of research using different techniques… with 2 good papers. In 10 years you have 4 good papers.

In social sciences on the other hand, there’s all this data collected by others just waiting to be looked at: voting, census, bureau of labor stats, world bank… it would be crazy to publish 3 or 4 papers a decade when there’s no-cost data to be had.

Perhaps engineering/physics is in-between. There’s plenty of room for theoretical studies, computational followups, and some experiments to confirm. There are no animal use restrictions or human clinical trials to be organized. Maybe a reasonable rate for civil engineering is 1 good paper a year, two if you are pure theory with no experimental/computational component.

• Also note, the biology work will have had a PI, a postdoc, a grad student, and 2 technicians involved in each paper.

The engineering paper will have had a PI and a grad student or a post doc usually not both. So, if a PI can supervise 1-4 grad students or postdocs in biology the PI can get maybe a factor of 2 over this base rate, in engineering maybe up to a factor of 4. Typically though that’s going to be in steady-state (ie. after the tenure period at maybe 10 yrs into the career).

But then depending on the style of interaction, putting the PI on some of these papers can be questionable. Particularly say when post-docs get some initial suggestion from a PI, go off and do all the work and can never get a meeting with the PI… sometimes it’s just rubber stamping the PI’s name on the author list, perhaps largely because they got the grant to do the work. Situations can vary a lot both in different fields and from person to person. There are notorious examples of this sort of thing in Physics that I’ve heard about, where the person who did the actual work wasn’t even put on the paper.

2. Ian says:

I am also trying to envision how we will generate subject specific prior distributions that would be accepted by most of our peers (and our field specific journals). We could use meta-analyses to help generate this (at least up to a point). This would be very useful for computing odds ratios and posterior probabilities initially. However would our updates posteriors then be used for the prior distribution by the field? How often would we need to update it (as a field)? I could also see people doing some silly things by combining the old prior distribution with the new posterior to generate a prior (which would effectively incorporate the information from the old prior twice).

Maybe you were envisioning something different?

• Andrew says:

Ian:

Yes, it’s tough. One answer (appropriate in that we’ve been thinking a lot about Lindley lately) is to say that it is good discipline to be forced to state and work with a subject-matter-specific prior distribution for effect sizes. Even if the prior is wrong (as it certainly will be) in the sense of not actually capturing the population distribution of true effects, it can be a start, and it points the discussion on the article toward a discussion of base rates and existing evidence for effect sizes.

• K? O'Rourke says:

Sander Greenland and likely others have written on this – getting a credible reference set of past studies and then flattening the (correctly calculated) posterior to get a _safer_ prior seemed to be more critical considerations.

But it always compared to the alternative (e.g. as WC Fields once said the alternative to aging is not that attractive).

• charles says:

The multiple testing corrections in genomics are a roundabout way of introducing a domain-specific prior. As sample sizes get larger and larger, issues with correcting for base rates using multiple comparison adjustments will probably lead to some confusion.

3. Andrew, in reading Entsophy’s crusades against frequentism it has occurred to me several times that the normalization of probability distributions to integrate to 1 is more or less a very very convenient convention, and that other types of normalization can also be relevant. I’m thinking in particular of normalizing a density so that its peak density is 1, and then choosing ‘confidence intervals’ based on relative probability density. So for example instead of choosing the region which contains 95% of the probability mass, why not choose the region which contains density as a fraction of max density which is greater than say 0.01 or some other convention. I’ve seen this similar idea before in frequentist contexts where “likelihood ratio” based confidence intervals are occasionally encouraged.

(note I am fully aware and agree with your point about the best method for evaluating claims being a fully decision theoretic one, including costs, benefits, and probabilities. I’m just pointing out here that the standard p value construction using tail probability is not necessarily very Bayesian)

To get a sense of how this works relative to standard methods, consider two cases the standard Normal, and Cauchy. under this recommendation we have normal ~ [-3,3] and cauchy ~ [-10,10]

the associated core probabilities are: 99.7% and 93.7%

I think the point of all this is to emphasize two things: the density means how credible a given region is and a relative density of 0.01 means 100 times less credible than the max. This is a local measure unlike the tail probability which emphasizes the probability that something might exceed a certain value (or more generally be ANYWHERE outside a certain region). And, secondly long tailed distributions can require us to go pretty far from actually credible values in order to integrate enough total probability to make up say 95% of the total. The 95% confidence interval for cauchy is [-12.7, 12.7] but the density at 12.7 is 0.62% of the max density. 162 times less credible than the value at 0.

• K? O'Rourke says:

Doesn’t the “emphasize two things” bring us back to here http://statmodeling.stat.columbia.edu/2013/11/21/hidden-dangers-noninformative-priors/#comment-151654 ?

Also your first paragraph does seem like a description of relative belief intervals that Mike Evans is working on e.g. http://ba.stat.cmu.edu/journal/forthcoming/evans.pdf

• jrc says:

Man – you are all on fire today. I’ll throw an option I’ve been toying with just so I get to play too:

How about an empirical idea, and I’ll stick with the social sciences but I think it could work for lab rats (literally) too.

We think some new teaching method will improve test scores. We get scores from everyone, get a point estimate of the effect of treatment using a regression/comparison, and then do a permutation test. This gives us a kind of empirical p-value. But we want to know something about precision – so let’s say our point estimate is 1sd (student scores under new teaching method are 1sd higher than under old method, and that result is unlikely given the variation in the data). So now we subtract .1sd from the scores of all the students in treated schools, and do another permutation test. We reject that at a rate below some threshold, so now we subtract .2sd and run it again. Eventually, we’ll fail to reject based on our threshold, and we’ve found a lower bound (the last amount we could subtract and still reject).

I guess my point is that from my empiricist standpoint, a permutation test gives us a really reliable p-value given the sample. That doesn’t necessarily translate into a p-value about the population (or super-population, or whatever), but it is clean and clear and appropriately relative (relative to the other permutations of the data). I guess for me the question of “what is the probability that this result reveals something real in the world” is just asking for too much, and sort of asking an awkward question, because, as Andrew hammers over and over, there probably isn’t one particular real world parameter out there in the first place.

I come back to the discussion of what statistics should do – data analysis should help us learn about the world. The first step: “is the result I get likely given the data?” (that’s where my permutation test/lower bound come in). The second step is “how much does this change my view of the world or impact how I should act in the future?”. Here is where Andrew’s cost/benefit/probability decision making comes in. The only part I remained unconvinced about is where in the process we should model the cost/benefit/probability tradeoffs – the Gelman Bayesian response seems to be “in the estimation process,” whereas my skeptical view and desire for super-clean-and-clear statistical analysis leans (at the moment) more to “after the estimation process.”

• Permutation tests and similar random number generator strategies answer the question “Is this data consistent with a model in which they come from some particular purely random number generator”. I guess there are situations where this makes sense. For example, you’re trying to use your “knowledge” to pick important genes out of the genome and you want to determine if the ones you picked have more of some specific feature than a set selected by a random number generator, or if you want to determine whether a particular economic outcome is consistent with pure noise. Typically though, if you have an actual model of some process, the Bayesian machinery is going to address the question of interest more directly: “what are the plausible values of the parameters in the model of the process?” You can then determine whether sign or magnitude errors in those parameters are important.

So, I guess what I’m saying is it all depends on what question you’re asking and whether “random number generation” is a plausible explanation for your data.

4. Rahul says:

“because we’re assuming flat priors that don’t really make sense.”

One thing I’d love to do is go to ten random researchers and ask them to draw the prior that they’d use. And then compare how similar they are.

Flat priors may not make sense, but is there a consensus prior at all?

• Corey says:

It’s been done, more-or-less. (Key word: “elicitation”.) Subject matter experts tend to be overconfident, leading to a lack of intersubjectively coherence, e.g., in a collection of 90% subjective intervals for some unknown quantity, far fewer than 90% of the intervals would contain the unknown value — for *any* possible value.

• Hence the tendency to want to broaden expert priors before using them as a kind of “conservatism”. But broadening expert priors might be a much better way to go than starting with infinitely broad “defaults”.

• In particular it might be interesting to do Bayesian inference on “broadening factors” for expert opinion in various contexts. Meta-Bayes!

• Rahul says:

But if “expert” opinion seems so broad & no expert essentially agrees with the others what’s the point in pushing people away from flat priors?

Dump flat priors and then go to _what_? No one seems to even remotely agree on a good prior choice for any one problem……

• Andrew says:

Rahul:

It depends on the example. But in lots of cases reasonable researchers can agree at least on weak priors. For example if you’re studying changes of vote preferences during presidential election campaigns, it’s highly unlikely there will be true effects of 5 percentage points or more.

The idea that 5:1 odds represents strong evidence came up recently on this blog. As I pointed out then, it’s the sort of odds that routinely gets beaten at the poker table – doesn’t that make it a stretch to call it strong evidence?

I prefer to think of 5:1 odds primarily as an expression of serious uncertainty.

• Andrew says:

It depends on the context. But if I think there’s a 5:1 chance that Treatment A will help me more than Treatment B (and assuming that the effect size is not tiny, that the distribution of effects is symmetrical, and there are no huge cost differences between the treatments), then, yes, I’d think that’s a strong reason to go with Treatment A. Again, I think that one reason people don’t think of 5:1 as strong odds is that the odds that get calculated, are typically calculated with respect to priors that we don’t believe.

Setting aside the issue of the priors for now, I’d say we need vocabulary to distinguish between (weak) evidence that’s strong enough to give us a clear preference for one action over another without removing our sense of uncertainty, and (strong) evidence that all but removes uncertainty.

• Andrew says:

That’s where the decision analysis comes in. If we’re worried about a major earthquake hitting NYC tomorrow, then 5:1 odds or even 5000:1 odds are not enough for anything close to certainty! I think that with scientific claims, it’s rare that one paper will give certainty, especially in the part of science that needs statistics to demonstrate that effects could not have just come by chance. So, there, I think we need to generally move away from the hope/expectation/norm that a single study gives near-certainty. Maybe that’s why I’m ok with 5:1 odds, because I’m typically not prepared to take a single study as definitive evidence in any case.

6. Steve Sailer says:

Bayesianism is logically superior to Fisherianism, but “priors” sound a lot like “prejudices,” which we all know are the worst things in the world, so Fisher’s dopey 0.05 system at least isn’t prejudiced.

• Andrew says:

Steve:

I wouldn’t say that Bayesianism is logically superior to Fisherianism—it all depends on how good the assumptions are, and what information is available—but, in any case, the 0.05 system can indeed be prejudiced, because there is a lot of choice in what tests to make, and what test to focus on.

7. EJ Wagenmakers says:

Jeffreys argued that Bayes factors lower than 3 are “not worth more than a bare mention”. Jeffreys also described a Bayes factor as high as 5.33 as “odds that would interest a gambler, but would be hardly worth more than a passing mention in a scientific paper” (Jeffreys, 1961, pp. 256-257). I guess the Bayes factor is what it is, and there should be no need to arbitrary thresholds, but I personally do like the rule of 3 for being “not worth more than a bare mention” — this prevents researchers from making strong claims based on flimsy evidence.

• Andrew says:

EJ:

Again, I think this comes from Jeffreys (and others) using priors they don’t really believe. 3:1 odds based on a flat prior you don’t believe, sure, that’s not much to go on. But an actual 75% probability (for example, the consensus 75% probability, as of October 2012, that Obama would win the election), that’s worth a mention for sure.

As we’ve discussed, if you start with a flat prior on theta and then observe data y ~ N(theta,1) with y=1, then your posterior probability is 84% that theta>0. That’s 5:1 odds. But it’s not real 5:1 odds. It’s 5:1 odds based on a model we don’t believe.

• Mayo says:

Andrew: This accordance with 1-sided tests, however, differs from Johnson who appears to put .5 on the null and on the max likely point alternative.

8. Nick Menzies says:

It seems to me that another consequence of this change would be to increase the bias in reported effect sizes, unless researchers quickly respond by increasing the effective sample size of their studies. Even in those cases where our prior is that an effect exists, with the same sign as shown in our analysis, what we get from the study will be too large. This would be especially true in fields where the power of individual studies is limited for some reason, so strong conclusions have historically required the accumulation of many studies pointing in the same direction.

The problem arises due to the separation of inference and decision-making. If the algorithm is…
(1) Evaluate whether intervention/policy/etc has a big enough effect that we should care (this usually means statistical significance and an effect size that we might care about). If step 1 says yes, then..
(2) Conduct some kind of cost-benefit analysis that weighs the consequences of the intervention/policy/etc against what we might do otherwise.
(3) Act on the results of step 2.

Because of the way we do step 1, we are biasing the results of step 2.

To me it seems the requirements we place on the quality of evidence derive pretty directly from the friction costs of policy change (which might be infinite if a policy is irreversible). If we can switch between interventions at no cost, then there is no role for statistical significance, even if reformulated in a more appropriate way (http://tinyurl.com/Claxton-Irrelevance). The potential value of further research is another reason to consider current uncertainty, but as both of these issues (friction costs of policy change, value of new research) are so context dependent, no single rule will suffice.

In reality we cant expect every study to undertake a decision analysis, especially since a lot of study results will only relate indirectly to actual decisions, so we need some heuristics to work with. If this is the case, it seems to me that a p<0.1 would actually be better than a p<0.005 criterion. For p<0.1, we have more information reported, so less of a bias in effect sizes. As we would have so many 'conflicting' results reported, it would be clearer that we need to base our beliefs on some kind of evidence synthesis rather just the results of the most recent trial. In other words, a p<0.1 criterion would propel us more rapidly towards the appropriate system for understanding new research findings.

9. […] comments and critiques.  The best ones I have read so far are the posts written by  Xi’An, Andrew Gelman and William Briggs, in their blogs  and the piece that Erika Hayden has written for […]

10. Mayo says:

Andrew: The past several weeks, a number of people had been sending me Johnson’s “Uniformly Most Powerful Bayesian Tests” paper, and now that our term is complete, I’ve had a look at it. How lucky, therefore, that your note was just posted on it. It helped me to get (more or less) to the bottom of things, and I agree with what you say (especially about the .5 priors). I’m wondering how he interprets a rejection (assuming we have set up one of the tests he recommends). Take his example of a one sided (positive) Normal test (Ex. 3.2 p. 15) with sigma known. Does one take a rejection as evidence for the specific alternative against which the Bayes Factor reaches his chosen gamma? Or does one just infer evidence for the composite non-null?

I am a bit puzzled to see Johnson say, on p. 3, that his approach “provides a remedy to the two primary deficiencies of classical significance tests—their inability to quantify evidence in favor of the null hypothesis when the null hypothesis is not rejected, and their tendency to exaggerate evidence against the null when it is.” I would deny these, but I’ll put off reacting until I get clear on his interpretation of reject the null. Any thoughts would be appreciated.

11. Mayo says:

OK, I’ve pretty much figured out Johnson for now, and have some comments, if interested, here: http://errorstatistics.com/2013/12/19/a-spanos-lecture-on-frequentist-hypothesis-testing/#comment-18737

12. […] regular readers of this blog are aware, a few months ago Val Johnson published an article, “Revised standards for statistical evidence,” making a Bayesian argument […]