Andrew: This accordance with 1-sided tests, however, differs from Johnson who appears to put .5 on the null and on the max likely point alternative.

]]>I am a bit puzzled to see Johnson say, on p. 3, that his approach “provides a remedy to the two primary deficiencies of classical significance tests—their inability to quantify evidence in favor of the null hypothesis when the null hypothesis is not rejected, and their tendency to exaggerate evidence against the null when it is.” I would deny these, but I’ll put off reacting until I get clear on his interpretation of reject the null. Any thoughts would be appreciated.

]]>Permutation tests and similar random number generator strategies answer the question “Is this data consistent with a model in which they come from some particular purely random number generator”. I guess there are situations where this makes sense. For example, you’re trying to use your “knowledge” to pick important genes out of the genome and you want to determine if the ones you picked have more of some specific feature than a set selected by a random number generator, or if you want to determine whether a particular economic outcome is consistent with pure noise. Typically though, if you have an actual model of some process, the Bayesian machinery is going to address the question of interest more directly: “what are the plausible values of the parameters in the model of the process?” You can then determine whether sign or magnitude errors in those parameters are important.

So, I guess what I’m saying is it all depends on what question you’re asking and whether “random number generation” is a plausible explanation for your data.

]]>EJ:

Again, I think this comes from Jeffreys (and others) using priors they don’t really believe. 3:1 odds based on a flat prior you don’t believe, sure, that’s not much to go on. But an actual 75% probability (for example, the consensus 75% probability, as of October 2012, that Obama would win the election), that’s worth a mention for sure.

As we’ve discussed, if you start with a flat prior on theta and then observe data y ~ N(theta,1) with y=1, then your posterior probability is 84% that theta>0. That’s 5:1 odds. But it’s not real 5:1 odds. It’s 5:1 odds based on a model we don’t believe.

]]>The problem arises due to the separation of inference and decision-making. If the algorithm is…

(1) Evaluate whether intervention/policy/etc has a big enough effect that we should care (this usually means statistical significance and an effect size that we might care about). If step 1 says yes, then..

(2) Conduct some kind of cost-benefit analysis that weighs the consequences of the intervention/policy/etc against what we might do otherwise.

(3) Act on the results of step 2.

Because of the way we do step 1, we are biasing the results of step 2.

To me it seems the requirements we place on the quality of evidence derive pretty directly from the friction costs of policy change (which might be infinite if a policy is irreversible). If we can switch between interventions at no cost, then there is no role for statistical significance, even if reformulated in a more appropriate way (http://tinyurl.com/Claxton-Irrelevance). The potential value of further research is another reason to consider current uncertainty, but as both of these issues (friction costs of policy change, value of new research) are so context dependent, no single rule will suffice.

In reality we cant expect every study to undertake a decision analysis, especially since a lot of study results will only relate indirectly to actual decisions, so we need some heuristics to work with. If this is the case, it seems to me that a p<0.1 would actually be better than a p<0.005 criterion. For p<0.1, we have more information reported, so less of a bias in effect sizes. As we would have so many 'conflicting' results reported, it would be clearer that we need to base our beliefs on some kind of evidence synthesis rather just the results of the most recent trial. In other words, a p<0.1 criterion would propel us more rapidly towards the appropriate system for understanding new research findings.

]]>Also note, the biology work will have had a PI, a postdoc, a grad student, and 2 technicians involved in each paper.

The engineering paper will have had a PI and a grad student or a post doc usually not both. So, if a PI can supervise 1-4 grad students or postdocs in biology the PI can get maybe a factor of 2 over this base rate, in engineering maybe up to a factor of 4. Typically though that’s going to be in steady-state (ie. after the tenure period at maybe 10 yrs into the career).

But then depending on the style of interaction, putting the PI on some of these papers can be questionable. Particularly say when post-docs get some initial suggestion from a PI, go off and do all the work and can never get a meeting with the PI… sometimes it’s just rubber stamping the PI’s name on the author list, perhaps largely because they got the grant to do the work. Situations can vary a lot both in different fields and from person to person. There are notorious examples of this sort of thing in Physics that I’ve heard about, where the person who did the actual work wasn’t even put on the paper.

]]>Yes.

]]>I think it’s not just different people, but different fields. I have experience in engineering/physics and biology/medicine. Particularly in biology/medicine it can take a big part of a decade to do a laboratory study or a series of clinical trials. Suppose it takes 5 years to carry out some studies in mice, and you get two good papers out of that, and then another 5 years to follow up with a new line of research using different techniques… with 2 good papers. In 10 years you have 4 good papers.

In social sciences on the other hand, there’s all this data collected by others just waiting to be looked at: voting, census, bureau of labor stats, world bank… it would be crazy to publish 3 or 4 papers a decade when there’s no-cost data to be had.

Perhaps engineering/physics is in-between. There’s plenty of room for theoretical studies, computational followups, and some experiments to confirm. There are no animal use restrictions or human clinical trials to be organized. Maybe a reasonable rate for civil engineering is 1 good paper a year, two if you are pure theory with no experimental/computational component.

]]>Steve:

I wouldn’t say that Bayesianism is logically superior to Fisherianism—it all depends on how good the assumptions are, and what information is available—but, in any case, the 0.05 system can indeed be prejudiced, because there is a lot of choice in what tests to make, and what test to focus on.

]]>Rahul:

It depends on the example. But in lots of cases reasonable researchers can agree at least on weak priors. For example if you’re studying changes of vote preferences during presidential election campaigns, it’s highly unlikely there will be true effects of 5 percentage points or more.

]]>Problem is at the reader end. We’ve given a reader no easy way to distinguish between papers. An “extremely valuable & correct paper” looks the same superficially as crap churned out in a month.

What we sorely need is some flavor of reputational or quality validation metric.

]]>But if “expert” opinion seems so broad & no expert essentially agrees with the others what’s the point in pushing people away from flat priors?

Dump flat priors and then go to _what_? No one seems to even remotely agree on a good prior choice for any one problem……

]]>The multiple testing corrections in genomics are a roundabout way of introducing a domain-specific prior. As sample sizes get larger and larger, issues with correcting for base rates using multiple comparison adjustments will probably lead to some confusion.

]]>Man – you are all on fire today. I’ll throw an option I’ve been toying with just so I get to play too:

How about an empirical idea, and I’ll stick with the social sciences but I think it could work for lab rats (literally) too.

We think some new teaching method will improve test scores. We get scores from everyone, get a point estimate of the effect of treatment using a regression/comparison, and then do a permutation test. This gives us a kind of empirical p-value. But we want to know something about precision – so let’s say our point estimate is 1sd (student scores under new teaching method are 1sd higher than under old method, and that result is unlikely given the variation in the data). So now we subtract .1sd from the scores of all the students in treated schools, and do another permutation test. We reject that at a rate below some threshold, so now we subtract .2sd and run it again. Eventually, we’ll fail to reject based on our threshold, and we’ve found a lower bound (the last amount we could subtract and still reject).

I guess my point is that from my empiricist standpoint, a permutation test gives us a really reliable p-value given the sample. That doesn’t necessarily translate into a p-value about the population (or super-population, or whatever), but it is clean and clear and appropriately relative (relative to the other permutations of the data). I guess for me the question of “what is the probability that this result reveals something real in the world” is just asking for too much, and sort of asking an awkward question, because, as Andrew hammers over and over, there probably isn’t one particular real world parameter out there in the first place.

I come back to the discussion of what statistics should do – data analysis should help us learn about the world. The first step: “is the result I get likely given the data?” (that’s where my permutation test/lower bound come in). The second step is “how much does this change my view of the world or impact how I should act in the future?”. Here is where Andrew’s cost/benefit/probability decision making comes in. The only part I remained unconvinced about is where in the process we should model the cost/benefit/probability tradeoffs – the Gelman Bayesian response seems to be “in the estimation process,” whereas my skeptical view and desire for super-clean-and-clear statistical analysis leans (at the moment) more to “after the estimation process.”

]]>I don’t like the whole capitalizing-the-word-“Normal” thing but otherwise the paper looks interesting.

]]>Daniel:

Different people have different styles. It wold kill me to only publish 2 or 3 papers a decade!

]]>That’s where the decision analysis comes in. If we’re worried about a major earthquake hitting NYC tomorrow, then 5:1 odds or even 5000:1 odds are not enough for anything close to certainty! I think that with scientific claims, it’s rare that one paper will give certainty, especially in the part of science that needs statistics to demonstrate that effects could not have just come by chance. So, there, I think we need to generally move away from the hope/expectation/norm that a single study gives near-certainty. Maybe that’s why I’m ok with 5:1 odds, because I’m typically not prepared to take a single study as definitive evidence in any case.

]]>Setting aside the issue of the priors for now, I’d say we need vocabulary to distinguish between (weak) evidence that’s strong enough to give us a clear preference for one action over another without removing our sense of uncertainty, and (strong) evidence that all but removes uncertainty.

]]>Something like this http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.ba/1346158771 ?

(Combining Expert Opinions in Prior Elicitation)

]]>Konrad:

It depends on the context. But if I think there’s a 5:1 chance that Treatment A will help me more than Treatment B (and assuming that the effect size is not tiny, that the distribution of effects is symmetrical, and there are no huge cost differences between the treatments), then, yes, I’d think that’s a strong reason to go with Treatment A. Again, I think that one reason people don’t think of 5:1 as strong odds is that the odds that get calculated, are typically calculated with respect to priors that we don’t believe.

]]>I prefer to think of 5:1 odds primarily as an expression of serious uncertainty.

]]>In particular it might be interesting to do Bayesian inference on “broadening factors” for expert opinion in various contexts. Meta-Bayes!

]]>Hence the tendency to want to broaden expert priors before using them as a kind of “conservatism”. But broadening expert priors might be a much better way to go than starting with infinitely broad “defaults”.

]]>There are at least a couple problems with this:

1) you pollute the world with things like “lower salt diets reduce the risk of heart disease” followed by “lower salt diets do not have clear benefit for heart disease” followed by “lower salt diets increase risk of overall mortality” or…

2) You waste society’s time and money looking at things that have very little risk of causing harm when you’re wrong because they have very little importance for the world at large… such as “new dynamic model of iron transport in the ocean shows that previous dynamic model of iron transport in the ocean may have been off by a factor of 2 in small regions near the coast of Alaska”. Sure it’s of interest to a few people, and sure it might someday be relevant to some actual decision someone has to make, but in point of fact, both models are probably very wrong anyway especially since it takes enormous amounts of money to collect enough data to calibrate them, and they’re poorly specified to begin with.

The third option is to do good careful science that means something important for the world, like studying the actual physiological mechanisms and feedbacks associated with blood pressure regulation and the effect of diet on that process and things like that… publish 2 or 3 extremely valuable and correct papers a decade… if you make it past your 3 to 5 year mid-tenure review with no publications and no grants. This seems to appeal to people who like playing high-stakes lotteries or dealing drugs:

http://alexandreafonso.wordpress.com/2013/11/21/how-academia-resembles-a-drug-gang/

http://www.insidehighered.com/quicktakes/2011/09/02/prof-charged-leading-motorcycle-gang-drug-ring

]]>It’s been done, more-or-less. (Key word: “elicitation”.) Subject matter experts tend to be overconfident, leading to a lack of intersubjectively coherence, e.g., in a collection of 90% subjective intervals for some unknown quantity, far fewer than 90% of the intervals would contain the unknown value — for *any* possible value.

]]>Doesn’t the “emphasize two things” bring us back to here http://statmodeling.stat.columbia.edu/2013/11/21/hidden-dangers-noninformative-priors/#comment-151654 ?

Also your first paragraph does seem like a description of relative belief intervals that Mike Evans is working on e.g. http://ba.stat.cmu.edu/journal/forthcoming/evans.pdf

]]>+2

… or you can go with it an publish things as a record of your activity even if you’re not as sure as you would like about being right?

]]>Sander Greenland and likely others have written on this – getting a credible reference set of past studies and then flattening the (correctly calculated) posterior to get a _safer_ prior seemed to be more critical considerations.

But it always compared to the alternative (e.g. as WC Fields once said the alternative to aging is not that attractive).

]]>One thing I’d love to do is go to ten random researchers and ask them to draw the prior that they’d use. And then compare how similar they are.

Flat priors may not make sense, but is there a consensus prior at all?

]]>+1

]]>(note I am fully aware and agree with your point about the best method for evaluating claims being a fully decision theoretic one, including costs, benefits, and probabilities. I’m just pointing out here that the standard p value construction using tail probability is not necessarily very Bayesian)

To get a sense of how this works relative to standard methods, consider two cases the standard Normal, and Cauchy. under this recommendation we have normal ~ [-3,3] and cauchy ~ [-10,10]

the associated core probabilities are: 99.7% and 93.7%

I think the point of all this is to emphasize two things: the density means how credible a given region is and a relative density of 0.01 means 100 times less credible than the max. This is a local measure unlike the tail probability which emphasizes the probability that something might exceed a certain value (or more generally be ANYWHERE outside a certain region). And, secondly long tailed distributions can require us to go pretty far from actually credible values in order to integrate enough total probability to make up say 95% of the total. The 95% confidence interval for cauchy is [-12.7, 12.7] but the density at 12.7 is 0.62% of the max density. 162 times less credible than the value at 0.

]]>*Sigh* and this is exactly the kind of thing that makes me want to stay out of academia. I have such a love-hate relationship with academia, but I know that while I’m not necessarily likely to publish things as important as Peter Higgs, I’m sure as hell not going to be able to sleep at night if I am pumping out paper after paper about blather whose only real purpose is to up my publication score. My wife has had the same problem: she’s a very careful bio-scientist who takes the time and effort to get things really understood in a fundamental and correct way. This leads to the kind of career stress you are talking about.

The hedge funds that are large universities are also not really about “academic” performance as organizations. They’re chasing donations and bloating their administration as fast as possible to game a financial system in much the same way that banks were 5 to 10 years ago. (just think of “student loans” as the new “predatory mortgage”)

]]>Ian:

Yes, it’s tough. One answer (appropriate in that we’ve been thinking a lot about Lindley lately) is to say that it is good discipline to be forced to state and work with a subject-matter-specific prior distribution for effect sizes. Even if the prior is wrong (as it certainly will be) in the sense of not actually capturing the population distribution of true effects, it can be a start, and it points the discussion on the article toward a discussion of base rates and existing evidence for effect sizes.

]]>Maybe you were envisioning something different?

]]>Publications serve two purposes — an announcement of scientific results, and a record of professional activity — and these are in tension. Academics are not rewarded for being right, they’re rewarded for publishing. (Nobel prizes may be an exception. Peter Higgs got his prize because it appears he was correct. But he also almost lost his job for not publishing enough.)

]]>