Comments on: Revised evidence for statistical standards

By: Revised statistical standards for evidence (comments to Val Johnson's comments on our comments on Val's comments on p-values) « Statistical Modeling, Causal Inference, and Social Science Statistical Modeling, Causal Inference, and Social S

Fri, 25 Apr 2014 10:10:23 +0000

[…] regular readers of this blog are aware, a few months ago Val Johnson published an article, “Revised standards for statistical evidence,” making a Bayesian argument […]

By: Mayo

Mayo — Sun, 22 Dec 2013 20:40:00 +0000

OK, I’ve pretty much figured out Johnson for now, and have some comments, if interested, here: http://errorstatistics.com/2013/12/19/a-spanos-lecture-on-frequentist-hypothesis-testing/#comment-18737

By: Mayo

Mayo — Sun, 22 Dec 2013 00:53:50 +0000

In reply to Andrew. Andrew: This accordance with 1-sided tests, however, differs from Johnson who appears to put .5 on the null and on the max likely point alternative.

By: Mayo

Mayo — Sat, 21 Dec 2013 01:16:45 +0000

Andrew: The past several weeks, a number of people had been sending me Johnson’s “Uniformly Most Powerful Bayesian Tests” paper, and now that our term is complete, I’ve had a look at it. How lucky, therefore, that your note was just posted on it. It helped me to get (more or less) to the bottom of things, and I agree with what you say (especially about the .5 priors). I’m wondering how he interprets a rejection (assuming we have set up one of the tests he recommends). Take his example of a one sided (positive) Normal test (Ex. 3.2 p. 15) with sigma known. Does one take a rejection as evidence for the specific alternative against which the Bayes Factor reaches his chosen gamma? Or does one just infer evidence for the composite non-null?

I am a bit puzzled to see Johnson say, on p. 3, that his approach “provides a remedy to the two primary deficiencies of classical significance tests—their inability to quantify evidence in favor of the null hypothesis when the null hypothesis is not rejected, and their tendency to exaggerate evidence against the null when it is.” I would deny these, but I’ll put off reacting until I get clear on his interpretation of reject the null. Any thoughts would be appreciated.

By: Daniel Lakeland

Daniel Lakeland — Fri, 20 Dec 2013 22:18:40 +0000

In reply to jrc.

Permutation tests and similar random number generator strategies answer the question “Is this data consistent with a model in which they come from some particular purely random number generator”. I guess there are situations where this makes sense. For example, you’re trying to use your “knowledge” to pick important genes out of the genome and you want to determine if the ones you picked have more of some specific feature than a set selected by a random number generator, or if you want to determine whether a particular economic outcome is consistent with pure noise. Typically though, if you have an actual model of some process, the Bayesian machinery is going to address the question of interest more directly: “what are the plausible values of the parameters in the model of the process?” You can then determine whether sign or magnitude errors in those parameters are important.

So, I guess what I’m saying is it all depends on what question you’re asking and whether “random number generation” is a plausible explanation for your data.

By: Andrew

Andrew — Fri, 20 Dec 2013 15:14:33 +0000

In reply to EJ Wagenmakers.

EJ:

Again, I think this comes from Jeffreys (and others) using priors they don’t really believe. 3:1 odds based on a flat prior you don’t believe, sure, that’s not much to go on. But an actual 75% probability (for example, the consensus 75% probability, as of October 2012, that Obama would win the election), that’s worth a mention for sure.

As we’ve discussed, if you start with a flat prior on theta and then observe data y ~ N(theta,1) with y=1, then your posterior probability is 84% that theta>0. That’s 5:1 odds. But it’s not real 5:1 odds. It’s 5:1 odds based on a model we don’t believe.

By: Things that make my heart go flump! | BioDiverse Perspectives

Things that make my heart go flump! | BioDiverse Perspectives — Fri, 20 Dec 2013 15:07:07 +0000

[…] comments and critiques. The best ones I have read so far are the posts written by Xi’An, Andrew Gelman and William Briggs, in their blogs and the piece that Erika Hayden has written for […]

By: Nick Menzies

Nick Menzies — Fri, 20 Dec 2013 15:06:08 +0000

It seems to me that another consequence of this change would be to increase the bias in reported effect sizes, unless researchers quickly respond by increasing the effective sample size of their studies. Even in those cases where our prior is that an effect exists, with the same sign as shown in our analysis, what we get from the study will be too large. This would be especially true in fields where the power of individual studies is limited for some reason, so strong conclusions have historically required the accumulation of many studies pointing in the same direction.

The problem arises due to the separation of inference and decision-making. If the algorithm is…
(1) Evaluate whether intervention/policy/etc has a big enough effect that we should care (this usually means statistical significance and an effect size that we might care about). If step 1 says yes, then..
(2) Conduct some kind of cost-benefit analysis that weighs the consequences of the intervention/policy/etc against what we might do otherwise.
(3) Act on the results of step 2.

Because of the way we do step 1, we are biasing the results of step 2.

To me it seems the requirements we place on the quality of evidence derive pretty directly from the friction costs of policy change (which might be infinite if a policy is irreversible). If we can switch between interventions at no cost, then there is no role for statistical significance, even if reformulated in a more appropriate way (http://tinyurl.com/Claxton-Irrelevance). The potential value of further research is another reason to consider current uncertainty, but as both of these issues (friction costs of policy change, value of new research) are so context dependent, no single rule will suffice.

In reality we cant expect every study to undertake a decision analysis, especially since a lot of study results will only relate indirectly to actual decisions, so we need some heuristics to work with. If this is the case, it seems to me that a p<0.1 would actually be better than a p<0.005 criterion. For p<0.1, we have more information reported, so less of a bias in effect sizes. As we would have so many 'conflicting' results reported, it would be clearer that we need to base our beliefs on some kind of evidence synthesis rather just the results of the most recent trial. In other words, a p<0.1 criterion would propel us more rapidly towards the appropriate system for understanding new research findings.

By: Daniel Lakeland

Daniel Lakeland — Fri, 20 Dec 2013 14:36:46 +0000

In reply to Daniel Lakeland.

Also note, the biology work will have had a PI, a postdoc, a grad student, and 2 technicians involved in each paper.

The engineering paper will have had a PI and a grad student or a post doc usually not both. So, if a PI can supervise 1-4 grad students or postdocs in biology the PI can get maybe a factor of 2 over this base rate, in engineering maybe up to a factor of 4. Typically though that’s going to be in steady-state (ie. after the tenure period at maybe 10 yrs into the career).

But then depending on the style of interaction, putting the PI on some of these papers can be questionable. Particularly say when post-docs get some initial suggestion from a PI, go off and do all the work and can never get a meeting with the PI… sometimes it’s just rubber stamping the PI’s name on the author list, perhaps largely because they got the grant to do the work. Situations can vary a lot both in different fields and from person to person. There are notorious examples of this sort of thing in Physics that I’ve heard about, where the person who did the actual work wasn’t even put on the paper.

By: Daniel Lakeland

Daniel Lakeland — Fri, 20 Dec 2013 14:27:11 +0000

In reply to Rahul. Yes.

By: Daniel Lakeland

Daniel Lakeland — Fri, 20 Dec 2013 14:25:45 +0000

In reply to Andrew.

I think it’s not just different people, but different fields. I have experience in engineering/physics and biology/medicine. Particularly in biology/medicine it can take a big part of a decade to do a laboratory study or a series of clinical trials. Suppose it takes 5 years to carry out some studies in mice, and you get two good papers out of that, and then another 5 years to follow up with a new line of research using different techniques… with 2 good papers. In 10 years you have 4 good papers.

In social sciences on the other hand, there’s all this data collected by others just waiting to be looked at: voting, census, bureau of labor stats, world bank… it would be crazy to publish 3 or 4 papers a decade when there’s no-cost data to be had.

Perhaps engineering/physics is in-between. There’s plenty of room for theoretical studies, computational followups, and some experiments to confirm. There are no animal use restrictions or human clinical trials to be organized. Maybe a reasonable rate for civil engineering is 1 good paper a year, two if you are pure theory with no experimental/computational component.

By: EJ Wagenmakers

EJ Wagenmakers — Fri, 20 Dec 2013 14:18:27 +0000

Jeffreys argued that Bayes factors lower than 3 are “not worth more than a bare mention”. Jeffreys also described a Bayes factor as high as 5.33 as “odds that would interest a gambler, but would be hardly worth more than a passing mention in a scientific paper” (Jeffreys, 1961, pp. 256-257). I guess the Bayes factor is what it is, and there should be no need to arbitrary thresholds, but I personally do like the rule of 3 for being “not worth more than a bare mention” — this prevents researchers from making strong claims based on flimsy evidence.

By: Andrew

Andrew — Fri, 20 Dec 2013 13:22:08 +0000

In reply to Steve Sailer.

Steve:

I wouldn’t say that Bayesianism is logically superior to Fisherianism—it all depends on how good the assumptions are, and what information is available—but, in any case, the 0.05 system can indeed be prejudiced, because there is a lot of choice in what tests to make, and what test to focus on.

By: Steve Sailer

Steve Sailer — Fri, 20 Dec 2013 12:18:28 +0000

Bayesianism is logically superior to Fisherianism, but “priors” sound a lot like “prejudices,” which we all know are the worst things in the world, so Fisher’s dopey 0.05 system at least isn’t prejudiced.

By: Andrew

Andrew — Fri, 20 Dec 2013 06:45:07 +0000

In reply to Rahul. Rahul: It depends on the example. But in lots of cases reasonable researchers can agree at least on weak priors. For example if you're studying changes of vote preferences during presidential election campaigns, it's highly unlikely there will be true effects of 5 percentage points or more.

By: Rahul

Rahul — Fri, 20 Dec 2013 04:33:11 +0000

In reply to Andrew. Problem is at the reader end. We've given a reader no easy way to distinguish between papers. An "extremely valuable & correct paper" looks the same superficially as crap churned out in a month. What we sorely need is some flavor of reputational or quality validation metric.

By: Rahul

Rahul — Fri, 20 Dec 2013 04:25:51 +0000

In reply to Daniel Lakeland.

But if “expert” opinion seems so broad & no expert essentially agrees with the others what’s the point in pushing people away from flat priors?

Dump flat priors and then go to _what_? No one seems to even remotely agree on a good prior choice for any one problem……

By: charles

charles — Fri, 20 Dec 2013 02:13:49 +0000

In reply to Ian. The multiple testing corrections in genomics are a roundabout way of introducing a domain-specific prior. As sample sizes get larger and larger, issues with correcting for base rates using multiple comparison adjustments will probably lead to some confusion.

By: jrc

jrc — Thu, 19 Dec 2013 22:04:02 +0000

In reply to Daniel Lakeland.

Man – you are all on fire today. I’ll throw an option I’ve been toying with just so I get to play too:

How about an empirical idea, and I’ll stick with the social sciences but I think it could work for lab rats (literally) too.

We think some new teaching method will improve test scores. We get scores from everyone, get a point estimate of the effect of treatment using a regression/comparison, and then do a permutation test. This gives us a kind of empirical p-value. But we want to know something about precision – so let’s say our point estimate is 1sd (student scores under new teaching method are 1sd higher than under old method, and that result is unlikely given the variation in the data). So now we subtract .1sd from the scores of all the students in treated schools, and do another permutation test. We reject that at a rate below some threshold, so now we subtract .2sd and run it again. Eventually, we’ll fail to reject based on our threshold, and we’ve found a lower bound (the last amount we could subtract and still reject).

I guess my point is that from my empiricist standpoint, a permutation test gives us a really reliable p-value given the sample. That doesn’t necessarily translate into a p-value about the population (or super-population, or whatever), but it is clean and clear and appropriately relative (relative to the other permutations of the data). I guess for me the question of “what is the probability that this result reveals something real in the world” is just asking for too much, and sort of asking an awkward question, because, as Andrew hammers over and over, there probably isn’t one particular real world parameter out there in the first place.

I come back to the discussion of what statistics should do – data analysis should help us learn about the world. The first step: “is the result I get likely given the data?” (that’s where my permutation test/lower bound come in). The second step is “how much does this change my view of the world or impact how I should act in the future?”. Here is where Andrew’s cost/benefit/probability decision making comes in. The only part I remained unconvinced about is where in the process we should model the cost/benefit/probability tradeoffs – the Gelman Bayesian response seems to be “in the estimation process,” whereas my skeptical view and desire for super-clean-and-clear statistical analysis leans (at the moment) more to “after the estimation process.”

By: Andrew

Andrew — Thu, 19 Dec 2013 21:42:13 +0000

In reply to K? O'Rourke. I don't like the whole capitalizing-the-word-"Normal" thing but otherwise the paper looks interesting.

By: Andrew

Andrew — Thu, 19 Dec 2013 21:40:46 +0000

In reply to Daniel Lakeland. Daniel: Different people have different styles. It wold kill me to only publish 2 or 3 papers a decade!

By: Andrew

Andrew — Thu, 19 Dec 2013 21:24:30 +0000

In reply to konrad. That's where the decision analysis comes in. If we're worried about a major earthquake hitting NYC tomorrow, then 5:1 odds or even 5000:1 odds are not enough for anything close to certainty! I think that with scientific claims, it's rare that one paper will give certainty, especially in the part of science that needs statistics to demonstrate that effects could not have just come by chance. So, there, I think we need to generally move away from the hope/expectation/norm that a single study gives near-certainty. Maybe that's why I'm ok with 5:1 odds, because I'm typically not prepared to take a single study as definitive evidence in any case.

By: konrad

konrad — Thu, 19 Dec 2013 20:56:17 +0000

In reply to Andrew. Setting aside the issue of the priors for now, I'd say we need vocabulary to distinguish between (weak) evidence that's strong enough to give us a clear preference for one action over another without removing our sense of uncertainty, and (strong) evidence that all but removes uncertainty.

By: K? O'Rourke

K? O'Rourke — Thu, 19 Dec 2013 20:23:33 +0000

In reply to Daniel Lakeland.

Something like this http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.ba/1346158771 ?

(Combining Expert Opinions in Prior Elicitation)

By: Andrew

Andrew — Thu, 19 Dec 2013 19:50:20 +0000

In reply to konrad.

Konrad:

It depends on the context. But if I think there’s a 5:1 chance that Treatment A will help me more than Treatment B (and assuming that the effect size is not tiny, that the distribution of effects is symmetrical, and there are no huge cost differences between the treatments), then, yes, I’d think that’s a strong reason to go with Treatment A. Again, I think that one reason people don’t think of 5:1 as strong odds is that the odds that get calculated, are typically calculated with respect to priors that we don’t believe.

By: konrad

konrad — Thu, 19 Dec 2013 19:44:41 +0000

The idea that 5:1 odds represents strong evidence came up recently on this blog. As I pointed out then, it’s the sort of odds that routinely gets beaten at the poker table – doesn’t that make it a stretch to call it strong evidence?

I prefer to think of 5:1 odds primarily as an expression of serious uncertainty.

By: Daniel Lakeland

Daniel Lakeland — Thu, 19 Dec 2013 19:38:52 +0000

In reply to Daniel Lakeland. In particular it might be interesting to do Bayesian inference on "broadening factors" for expert opinion in various contexts. Meta-Bayes!

By: Daniel Lakeland

Daniel Lakeland — Thu, 19 Dec 2013 19:37:13 +0000

In reply to Corey. Hence the tendency to want to broaden expert priors before using them as a kind of "conservatism". But broadening expert priors might be a much better way to go than starting with infinitely broad "defaults".

By: Daniel Lakeland

Daniel Lakeland — Thu, 19 Dec 2013 19:32:14 +0000

In reply to Krzysztof Sakrejda.

There are at least a couple problems with this:

1) you pollute the world with things like “lower salt diets reduce the risk of heart disease” followed by “lower salt diets do not have clear benefit for heart disease” followed by “lower salt diets increase risk of overall mortality” or…

2) You waste society’s time and money looking at things that have very little risk of causing harm when you’re wrong because they have very little importance for the world at large… such as “new dynamic model of iron transport in the ocean shows that previous dynamic model of iron transport in the ocean may have been off by a factor of 2 in small regions near the coast of Alaska”. Sure it’s of interest to a few people, and sure it might someday be relevant to some actual decision someone has to make, but in point of fact, both models are probably very wrong anyway especially since it takes enormous amounts of money to collect enough data to calibrate them, and they’re poorly specified to begin with.

The third option is to do good careful science that means something important for the world, like studying the actual physiological mechanisms and feedbacks associated with blood pressure regulation and the effect of diet on that process and things like that… publish 2 or 3 extremely valuable and correct papers a decade… if you make it past your 3 to 5 year mid-tenure review with no publications and no grants. This seems to appeal to people who like playing high-stakes lotteries or dealing drugs:

http://alexandreafonso.wordpress.com/2013/11/21/how-academia-resembles-a-drug-gang/

http://www.insidehighered.com/quicktakes/2011/09/02/prof-charged-leading-motorcycle-gang-drug-ring

By: Corey

Corey — Thu, 19 Dec 2013 19:31:59 +0000

In reply to Rahul.

It’s been done, more-or-less. (Key word: “elicitation”.) Subject matter experts tend to be overconfident, leading to a lack of intersubjectively coherence, e.g., in a collection of 90% subjective intervals for some unknown quantity, far fewer than 90% of the intervals would contain the unknown value — for *any* possible value.

By: K? O'Rourke

K? O'Rourke — Thu, 19 Dec 2013 19:14:36 +0000

In reply to Daniel Lakeland.

Doesn’t the “emphasize two things” bring us back to here http://statmodeling.stat.columbia.edu/2013/11/21/hidden-dangers-noninformative-priors/#comment-151654 ?

Also your first paragraph does seem like a description of relative belief intervals that Mike Evans is working on e.g. http://ba.stat.cmu.edu/journal/forthcoming/evans.pdf

By: Krzysztof Sakrejda

Krzysztof Sakrejda — Thu, 19 Dec 2013 19:05:09 +0000

In reply to Daniel Lakeland.

… or you can go with it an publish things as a record of your activity even if you’re not as sure as you would like about being right?

By: K? O'Rourke

K? O'Rourke — Thu, 19 Dec 2013 18:56:35 +0000

In reply to Ian.

Sander Greenland and likely others have written on this – getting a credible reference set of past studies and then flattening the (correctly calculated) posterior to get a _safer_ prior seemed to be more critical considerations.

But it always compared to the alternative (e.g. as WC Fields once said the alternative to aging is not that attractive).

By: Rahul

Rahul — Thu, 19 Dec 2013 18:51:32 +0000

“because we’re assuming flat priors that don’t really make sense.”

One thing I’d love to do is go to ten random researchers and ask them to draw the prior that they’d use. And then compare how similar they are.

Flat priors may not make sense, but is there a consensus prior at all?

By: Rahul

Rahul — Thu, 19 Dec 2013 18:46:16 +0000

In reply to Daniel Lakeland. +1

By: Daniel Lakeland

Daniel Lakeland — Thu, 19 Dec 2013 18:42:18 +0000

Andrew, in reading Entsophy’s crusades against frequentism it has occurred to me several times that the normalization of probability distributions to integrate to 1 is more or less a very very convenient convention, and that other types of normalization can also be relevant. I’m thinking in particular of normalizing a density so that its peak density is 1, and then choosing ‘confidence intervals’ based on relative probability density. So for example instead of choosing the region which contains 95% of the probability mass, why not choose the region which contains density as a fraction of max density which is greater than say 0.01 or some other convention. I’ve seen this similar idea before in frequentist contexts where “likelihood ratio” based confidence intervals are occasionally encouraged.

(note I am fully aware and agree with your point about the best method for evaluating claims being a fully decision theoretic one, including costs, benefits, and probabilities. I’m just pointing out here that the standard p value construction using tail probability is not necessarily very Bayesian)

To get a sense of how this works relative to standard methods, consider two cases the standard Normal, and Cauchy. under this recommendation we have normal ~ [-3,3] and cauchy ~ [-10,10]

the associated core probabilities are: 99.7% and 93.7%

I think the point of all this is to emphasize two things: the density means how credible a given region is and a relative density of 0.01 means 100 times less credible than the max. This is a local measure unlike the tail probability which emphasizes the probability that something might exceed a certain value (or more generally be ANYWHERE outside a certain region). And, secondly long tailed distributions can require us to go pretty far from actually credible values in order to integrate enough total probability to make up say 95% of the total. The 95% confidence interval for cauchy is [-12.7, 12.7] but the density at 12.7 is 0.62% of the max density. 162 times less credible than the value at 0.

By: Daniel Lakeland

Daniel Lakeland — Thu, 19 Dec 2013 18:18:00 +0000

In reply to John.

*Sigh* and this is exactly the kind of thing that makes me want to stay out of academia. I have such a love-hate relationship with academia, but I know that while I’m not necessarily likely to publish things as important as Peter Higgs, I’m sure as hell not going to be able to sleep at night if I am pumping out paper after paper about blather whose only real purpose is to up my publication score. My wife has had the same problem: she’s a very careful bio-scientist who takes the time and effort to get things really understood in a fundamental and correct way. This leads to the kind of career stress you are talking about.

The hedge funds that are large universities are also not really about “academic” performance as organizations. They’re chasing donations and bloating their administration as fast as possible to game a financial system in much the same way that banks were 5 to 10 years ago. (just think of “student loans” as the new “predatory mortgage”)

By: Andrew

Andrew — Thu, 19 Dec 2013 17:27:05 +0000

In reply to Ian. Ian: Yes, it's tough. One answer (appropriate in that we've been thinking a lot about Lindley lately) is to say that it is good discipline to be forced to state and work with a subject-matter-specific prior distribution for effect sizes. Even if the prior is wrong (as it certainly will be) in the sense of not actually capturing the population distribution of true effects, it can be a start, and it points the discussion on the article toward a discussion of base rates and existing evidence for effect sizes.

By: Ian

Ian — Thu, 19 Dec 2013 17:16:17 +0000

I am also trying to envision how we will generate subject specific prior distributions that would be accepted by most of our peers (and our field specific journals). We could use meta-analyses to help generate this (at least up to a point). This would be very useful for computing odds ratios and posterior probabilities initially. However would our updates posteriors then be used for the prior distribution by the field? How often would we need to update it (as a field)? I could also see people doing some silly things by combining the old prior distribution with the new posterior to generate a prior (which would effectively incorporate the information from the old prior twice).

Maybe you were envisioning something different?

By: John

John — Thu, 19 Dec 2013 15:29:13 +0000

Maybe we need to have separate standards for publishability and believeability. If we restrict scientists to only publishing results that are very likely to be correct, some fields would have to practically stop publishing papers.

Publications serve two purposes — an announcement of scientific results, and a record of professional activity — and these are in tension. Academics are not rewarded for being right, they’re rewarded for publishing. (Nobel prizes may be an exception. Peter Higgs got his prize because it appears he was correct. But he also almost lost his job for not publishing enough.)