Priors

Posted on July 16, 2013 9:13 AM by Andrew

Nick Firoozye writes:

While I am absolutely sympathetic to the Bayesian agenda I am often troubled by the requirement of having priors. We must have priors on the parameter of an infinite number of model we have never seen before and I find this troubling. There is a similarly troubling problem in economics of utility theory. Utility is on consumables. To be complete a consumer must assign utility to all sorts of things they never would have encountered. More recent versions of utility theory instead make consumption goods a portfolio of attributes. Cadillacs are x many units of luxury y of transport etc etc. And we can automatically have personal utilities to all these attributes.

I don’t ever see parameters. Some model have few and some have hundreds. Instead, I see data. So I don’t know how to have an opinion on parameters themselves. Rather I think it far more natural to have opinions on the behavior of models. The prior predictive density is a good and sensible notion. Also if we has conditional densities for VARs then the prior conditional density. You have opinions about how variables interact and the forecast of some subset conditioning on the remainder. That this may or may not give enough info to ascribe a proper prior in parameter space all the better. To the extent it does not we must arbitrarily pick one (eg reference prior or maxent prior subject to the data/model prior constraints). Without reference to actual data I do not see much point in trying to have any opinion at all.

My reply: I do have some thoughts on the topic, especially after seeing Larry’s remark (which I agree with) that “noninformative priors are a lost cause.”

As I wrote in response to Larry, in some specific cases, noninformative priors can improve our estimates (see here, for example), but in general I’ve found that it’s a good idea to include prior information. Even weak prior information can make a big difference (see here, for example).

And, yes, we can formulate informative priors in high dimensions, for example by assigning priors to lower-dimensional projections that we understand. A reasonable goal, I think, is for us to set up a prior distribution that is informative without hoping that it will include all our prior information. Which is the way we typically think about statistical models in general. We still have a ways to go, though, in developing intuition and experience with high-dimensional models such as splines and Gaussian processes.

I will illustrate some of the simpler (but hardly trivial) issues with prior distributions with two small examples.

Example 1: Consider an experiment estimating comparing two medical treatments with an estimated effect of 1 (on some scale) with standard error 1. Such a result is, of course, completely consistent with a zero effect. The usual Bayes inference (with noninformative uniform prior) is N(1,1), thus implying an 84% probability that the effect is positive.

This seems wrong, the idea that something recognizable as pure noise can lead to 5:1 posterior odds.

The problem is coming from the prior distribution. We can see this in two ways. First, just directly, effects near zero are more common than large effects. In our 2008 paper, Aleks and I argued that logistic regression coefficients are usually less than 1. So let’s try combine N(1,1) data with a Cauchy(0,1) prior. It’s easy enough to do in Stan

First the model (which I’ll save in a file “normal.stan”):

data { real y; } parameters { real theta; } model { theta ~ cauchy (0, 1); y ~ normal (theta, 1); }

Then the R script:

library ("rstan") y <- 1 fit1 <- stan(file="normal.stan", data = "y", iter = 1000, chains = 4) print (fit1) sim1 <- extract (fit1, permuted=TRUE) print (mean (sim1$theta > 0))

The result is 0.77, that is, roughly a 3:1 posterior probability that the effect is positive.

Just to check that I’m not missing anything, let me re-run using the flat prior. New Stan model:

data { real y; } parameters { real theta; } model { y ~ normal (theta, 1); }

and then I rerun with the same R code. This time, indeed, 84% of my posterior simulations of theta are greater than 0.

So far so good. Although one might argue that the posterior probability of 0.77 (from the inference given the unit Cauchy prior) is still too high. Perhaps we want a stronger prior? This sort of discussion is just fine. If you look at your posterior inference and it doesn’t make sense to you, this “doesn’t make sense” corresponds to additional prior information you haven’t included in your analysis.

OK, so that’s one way to consider the unreasonableness of a noninformative prior in this setting. It’s not so reasonable to believe that effects are equally likely to be any size. They’re generally more likely to be near zero.

The other way to see what’s going on with this example is to take that flat prior seriously. Suppose theta really could be just about anything—or, to keep things finite, suppose you wanted to assign theta a uniform prior distribution on [-1000,1000], and then you gather enough data to estimate theta with a standard deviation of 1. Then, a priori, you’re nearly certain to gather very very strong information about the sign of theta. To start with, there’s a 0.998 chance that your estimate will be more than 2 standard errors away from zero so that your posterior certainty about the sign of theta will be at least 20:1. And there’s a 0.995 chance that your estimate will be more than 5 standard errors away from zero.

So, in your prior distribution, this particular event—that y is so close to zero that there is uncertainty about theta’s sign—is extremely unlikely. And it would be irrelevant that y is not statistically significantly different from 0.

Example 2: The basic mathematics above is, in fact, relevant in many many real-life situations. Consider one of my favorite examples, the study that found that more attractive parents were more likely to have girls. The result from the data, after running the most natural (to me) regression analysis, was an estimate of 4.7% (that is, in the data at hand, more beautiful parents in the dataset were 4.7 percentage points, on average, more likely to have girls, compared to less beautiful parents) with a standard error of 4.3%. The published analysis (which isolated the largest observed difference in a multiple comparisons setting) was a difference of 8% with a standard error of about 3.5%. In either case, the flat-prior analysis gives you a high posterior probability that the difference is positive in the general population, and a high posterior probability that this difference is large (more than 1 percentage point, say).

Why do I say that a difference of more than 1 percentage point would be large? Because, in the published literature on sex ratios, most differences (as estimated from large populations) are much less than 1%. For example, African-American babies are something like 0.5% more likely to be girls, compared to European-American babies. The only really large effects in the literature come from big things like famines.

Based on the literature and on the difficulty of measuring attractiveness, I’d say that a reasonable weak prior distribution for the difference in probability of girl birth, comparing beautiful and ugly parents in the general population, is N(0,0.003^2), that is, normal centered at 0 with standard deviation 0.3 percentage points. This is equivalent to data from approximately 166,000 people. (Consider a survey with n parents. Compare sex ratio of prettiest n/3 to ugliest n/3, s.e. is sqrt(0.5^2/(n/3) + 0.5^2/(n/3)) = 0.5 sqrt(6/n). Equivalent info: 0.003 = 0.5 sqrt(6/n). Solve for n, you get 166,000.

The data analysis that started all this was based on a survey of about 3000 people. So it’s hopeless. The prior is much much stronger than the data.

The traditional way of presenting such examples in a Bayesian statistics book would be to use a flat prior or weak prior, perhaps trying to demonstrate a lack of sensitivity to the prior. But in this case such a strategy would be a mistake.

And I think lots of studies have this pattern, we’re studying small effects with small samples and using inefficient between-subject designs (not that there are any alternatives in the sex-ratio example).

Summary

To get back to the general question about priors: yes, modeling can be difficult. In some settings the data are strong and prior information is weak, and it’s not really worth the effort to think seriously about what external knowledge we have about the system being studied. More often than not, though, I think we do know a lot, and we’re interested in various questions where data are sparse, and I think we should be putting more effort into quantifying our prior distribution.

Upsetting situations—for example, the data of 1 +/- 1 which lead to a seemingly too-strong claim of 5:1 odds in favor of a positive effect—are helpful in that they can reveal that we have prior information that we have not yet included in our models.

48 thoughts on “Priors”

Big_Wazaa on July 16, 2013 10:04 AM at 10:04 am said:

This all makes good sense and I find it really interesting.

But I would guess that referees presented with a paper that proposes strong assumptions about priors and proceeds to analyze data based upon them will often freak out. However, I’ve never seen that scenario unfold, so I am just guessing.

Andy, you probably have probably watched cases like this, right? Any comments on that? I would imagine that someone trying this should alert the editor to what they are up to, perhaps before even submitting the paper. Oy.
Pingback: Priors « Pink Iguana
JSE on July 16, 2013 10:50 AM at 10:50 am said:

Is this prescriptive or descriptive? In other words, in your examples, would you describe what you’re doing as ‘figuring out what priors you should use in the analysis’ or ‘figuring out what priors you have?’
- Andrew on July 16, 2013 11:06 AM at 11:06 am said:
  
  Jse:
  
  The difficulty here is that “prior” has two meanings. It refers to prior information (i.e., previous data) and to the prior distribution (i.e., part of the model). Thus, I’d say:
  
  “figuring out what prior information we have,”
  
  and
  
  “figuring out what prior distribution we should use.”
  
  It’s generally unrealistic to expect our models to include all the information we have, but we typically want to incorporate available information into our model as much as is feasible.
Jonathan (another one) on July 16, 2013 10:59 AM at 10:59 am said:

This post nicely summarizes your central thesis, Andrew — that models are just as much selected as priors. maybe I’m wrong about this, but it seems to me the emphasis should be on robustness, not on models or priors. As Big_Wazaa above notes, referees will distrust a result which critically depends on highly informative priors, just as they would (one hopes) distrust a result which depends critically on a specific functional form, without some convincing handwaving (or real evidence) about the prior or the model, as befits the case.

If you get the same basic result (say at least on sign, if not absolute parameter estimate) from a wide variety of plausible priors, you’re in the same boat as someone who gets the same result from a wide variety of functional forms. The more you need a specific model (or narrow prior) to make your case, the worse off you are, unless the model or prior has near-universal appeal.
- Andrew on July 16, 2013 11:03 AM at 11:03 am said:
  
  Jonathan:
  
  I think there are two points here, both of which are important. First, it’s worth looking into robustness to model assumptions. Second, real prior information can be useful in an analysis, and we shouldn’t be embarrassed about that. In Example 2 above, the prior information is much stronger than the data, and it is foolish for people to ignore it. On the contrary, a flat prior distribution in that example makes no scientific sense.
  - Rahul on July 17, 2013 3:10 AM at 3:10 am said:
    
    >>>real prior information can be useful in an analysis, and we shouldn’t be embarrassed about that<<<
    
    So long as there is wide consensus about the prior.
    - Andrew on July 17, 2013 8:12 AM at 8:12 am said:
      
      Rahul:
      
      I think it’s ok to make assumptions even when there is not wide consensus over them. State your assumptions clearly and go from there, then others can dispute them, that’s fine. But if you are worried about your assumptions, please don’t focus just on the prior distribution. Assumptions (often unexamined) in the data model can have much larger consequences.
      
      In any case, if researchers would start by using strong priors in examples such as the sex-ratio study, where there is indisputably a huge amount of prior information, that would be a good start!
    - K? O'Rourke on July 17, 2013 11:42 AM at 11:42 am said:
      
      Thats one of the things I tried to get across – learning how to peer review priors to resolve disputes [reduce reluctance/embarassment]- in Two Cheers for Bayes [Letters to the Editors]. Controlled Clinical Trials, 17, 350-351.
      
      The supposed availability of a non-informative _pill_ to avoid this _pain_ likley was/is part of the problem.
    - Entsophy on July 17, 2013 1:30 PM at 1:30 pm said:
      
      “So long as there is wide consensus about the prior”
      
      Suppose two statisticians use very different priors. They both find an interval estimate for m whose true value is 100. They report:
      
      Statistician 1: m is in the interval (99,101)
      
      Statistician 2: m is in the interval (-10^9, +10^9)
      
      Which interval estimate do you consider wrong (they both correctly identify an interval that contains the true m)? Why can’t we say they both got it right, but the first statisticians answer is just more useful?
      
      I believe you’re retaining too much frequentist intuition, which since they equate prob=freq, leads them to believe there is uniquely one right distribution. But it’s simply not true that if two people have very different priors then things automatically go haywire. It’s easily possible for their answers to be consistent both with each other and the right value.
    - Jonathan (another one) on July 17, 2013 4:55 PM at 4:55 pm said:
      
      … or inconsistent. I think that’s the case we’re worried about.
    - Rahul on July 18, 2013 12:29 AM at 12:29 am said:
      
      Exactly what Jonathan says.
      
      If their differing priors gave (99,101) (101,103) (78,82) now what?
      
      Is there any reason to think prior choice will not cause this?
    - Entsophy on July 18, 2013 8:06 AM at 8:06 am said:
      
      Of course this can happen. It just means one of the statisticians starts out with prior which says the true value is in (0,80), then that implication is clearly wrong and the prior should be judged as wrong because of it.
      
      So you don’t need consensus at all. What you need you need is that they use priors which don’t make wrong implications about the world.
    - Entsophy on July 18, 2013 8:07 AM at 8:07 am said:
      
      I meant to say: “If one of the statisticians starts out with prior which says the true value is in (0,80), then that implication is clearly wrong and the prior should be judged as wrong because of it.”
    - K? O'Rourke on July 18, 2013 9:09 AM at 9:09 am said:
      
      Entsophy:
      
      > don’t make wrong implications
      make _less_ wrong implications (implications come from models)
      
      > the prior should be judged as wrong
      I would agree but many might side with Jay Kadane’s view (expressed in his insightfully written book Principles of Uncertainty) that seems to be priors can’t be wrong (or questioned) just updated.
    - Jonathan (another one) on July 18, 2013 9:21 AM at 9:21 am said:
      
      But Entsophy… take Rahul’s example. Which one has the “wrong” prior? And how would we know?
    - Entsophy on July 18, 2013 10:21 AM at 10:21 am said:
      
      K? O’Rourke,
      
      Every distribution P(x), whether it’s a prior, sampling, or posterior distribution, is making the exact same claim about the world: namely, that the true x is in the high probability manifold of P(x). If the claim is good then so is P().
      
      For example, when I assume errors e_1,…,e_n are IID normal, I’m claiming that the errors e_1,…,e_n that actually exist in my data are roughly in the sphere defined by the high probability manifold of the IID normal. I’m not claiming, and it’s generally not true, that present or future errors will have a histogram shaped like a bell curve.
      
      Jonathan (another one),
      
      “take Rahul’s example. Which one has the “wrong” prior? And how would we know?”
      
      It depends on the circumstances. It’s up to the modeler to insure they take true prior facts and encode then accurately into the prior. In most instances we can bound the range x accurately. In other instances we have knowledge of a deeper hypothesis space, the multiplicities of which are so heavily weight towards certain values that we should use highly informative priors.
      
      In the worst case scenario, if you really don’t know anything, then simply use a prior that makes the entire possible region for x coincide with the high probability region of the prior. That guarantees the true value will be in the region suggested by the prior.
    - george on July 18, 2013 11:17 AM at 11:17 am said:
      
      Entsophy;
      
      > Every distribution P(x), whether it’s a prior, sampling, or posterior distribution, is making the exact same
      > claim about the world: namely, that the true x is in the high probability manifold of P(x)
      
      Sampling distributions don’t claim this; there is no single “true” x.
      
      > I’m not claiming, and it’s generally not true, that present or future errors will have a histogram shaped
      > like a bell curve.
      
      The standard assumption, on writing that e_1 … e_n are IID Normal is that, if we repeated the experiment many times, that the e_i would have a histogram bell curve.
      
      If you mean something different by your assumptions, please spell it out for the rest of us.
    - Entsophy on July 18, 2013 12:46 PM at 12:46 pm said:
      
      “Sampling distributions don’t claim this; there is no single “true” x.”
      
      Everything in statistics is really about singular events. I mean this in two different ways. First, as a physical fact, every repeated trial is unique to a given time, place, and configuration of the universe and is never repeated. A repeated trial, taken a whole, is a singular event no different than something like “Obama wins in 2012”.
      
      Second, repeated trials x_1,…,x_n can always be modeled by a distribution P(x_1,…,x_n) whose fitness is judged exactly the way I described: it’s a good distribution if the true x_1,…,x_n lies in the high probability manifold of P(). The “random variables” mythology isn’t needed, and is in fact harmful to the extent that it’s less general. The original sin of Frequentist statistics is to take a very special case of this type of modeling and try to force-fit every example in statistics into an analogous mold.
      
      “The standard assumption, on writing that e_1 … e_n are IID Normal is that, if we repeated the experiment many times, that the e_i would have a histogram bell curve.”
      
      The standard assumption is non-sense on many levels. First, it’s almost never checked that a measuring device actually N(0,sigma) of the kind you describe. Statisticians simply assume this and go on. Frequentists assume it without checking and then bask in the glow of their superior objectivity and actually believe they have some kind of guarantee that they’re right.
      
      Second, on the few instances when it is checked it’s been found to be wrong. Third, it’s not needed because the only relevant thing is what we know about the errors that exist, not what we imagine future errors to look like. In the limiting case, if we knew the errors in the data precisely, then any knowledge of the histograms of future errors would be completely useless and irrelevant.
      
      To sum up, the standard assumption is almost never checked, almost never true, and completely unneeded. Thats why statisticians can get way with assuming IID so often without it causing more problems than it does. That’s way people like Gelman can recommend that you don’t need to check the normality of residuals in regression models (see one of his text books), which is becoming more and more the standard advice.
    - Rahul on July 18, 2013 1:01 PM at 1:01 pm said:
      
      @Entsophy
      
      “Second, on the few instances when it is checked it’s been found to be wrong.”
      
      Do you have a reference for this?
    - george on July 19, 2013 1:54 AM at 1:54 am said:
      
      > Everything in statistics is really about singular events. I mean this in two different ways. First, as a
      > physical fact, every repeated trial is unique to a given time, place, and configuration of the universe and
      > is never repeated.
      
      No. Try regulatory settings. There really are repeated events, and regulators really do care about keeping the rate of false positives in them at or below some acceptable rate.
      
      > Second, repeated trials x_1,…,x_n can always be modeled by a distribution P(x_1,…,x_n) whose fitness is
      > judged exactly the way I described: it’s a good distribution if the true x_1,…,x_n lies in the high
      > probability manifold of P().
      
      No. Try randomization-based inference. All sets of data under the “model” are equally likely, there isn’t a high probability manifold of the sort you describe.
      
      > The “random variables” mythology isn’t needed, and is in fact harmful to the extent that it’s less general.
      > The original sin of Frequentist statistics is to take a very special case of this type of modeling and try to
      > force-fit every example in statistics into an analogous mold.
      
      No. As noted many times on this blog, there’s a whole lot of good one can do with frequentist statistics. Several useful procedures with straightforward frequentist justifications are really ugly with alternative approaches. And accusing methods of “sin” goes nowhere.
      
      > “The standard assumption, on writing that e_1 … e_n are IID Normal is that, if we repeated the experiment
      > many times, that the e_i would have a histogram bell curve.”
      >
      > The standard assumption is non-sense on many levels.
      
      You might think it nonsense, but it’s a standard assumption, as I wrote. You, on the other hand, wrote that “when I assume errors e_1,…,e_n are IID normal, […] I’m not claiming, that present or future errors will have a histogram shaped like a bell curve.” You evidently mean something very non-standard by your statement of IID Normal – and you often make bold claims whilst defying the need for standard assumptions. So please, spell out what it is you do mean.
      
      > Statisticians simply assume this and go on. Frequentists assume it without checking and then bask in the glow
      > of their superior objectivity and actually believe they have some kind of guarantee that they’re right.
      
      Two reasons to be more careful here;
      
      1. Smug complacent auras are occasionally seen around non-frequentists, notably around those not keen on checking assumptions.
      
      2. Assumptions that lead to pretty derivations of a method are not always those that are actually necessary to justify the method. Checking these assumptions which don’t matter really is a waste of time, and can even backfire. Assumptions of Normality in “cookbook” methods are prime examples.
    - Entsophy on July 19, 2013 9:50 AM at 9:50 am said:
      
      Rahul,
      
      Yes I can. Take any one of the considerable number of Frequentists who’ve railed against the fact that NIID assumptions aren’t good in most cases, yet they seem to work unreasonably well in practice.
      
      But it would be more useful, and far more illuminating, if you did what almost no statistician ever does: get out your favorite measuring device and actually carefully measure what errors it produces. First see if the pattern of errors is even stable, let alone IID N(0,sigma). Seriously, the way statisticians throw around supposedly objective assumptions and talk about verifying them, without ever going into a laboratory to see what Mother Nature has to say on the subject is just bizarre.
      
      George,
      
      I don’t think you understood me at all. If you conducted 20 repetitions of trial on 1 July 2013 you can’t go back and repeat them. You can do another 20 repetitions 1 Aug 2013, but the universe has changed considerably in the mean time. You’re taking 20 new measurements about the universe in reality. The two sets of trials have exactly the same status as say “Obama wins in 2008” has to “Obama wins in 2012”. Each set is a singular even and Mother Nature can arrange to have any connection between them that she wants.
      
      If each point in the sample space is equally likely then the entire space is in the high probability manifold (i.e. the set {x: P(x)>a} for some a.)
      
      “notably around those not keen on checking assumptions.” I’m very keen on checking assumptions. It’s just that the assumptions that need checking are very different than those officially given. The official story makes claims about the long range frequency of errors. No one really checks these assumptions. The real assumption in play when using IID N(0,sigma) is that the true errors in the data lie in certain hypersphere. This is very often known (and hence checked) simply because each individual error is on the order of magnitude of sigma or less.
      
      This typically is the only true thing we do know about the errors and represents real facts concerning the measureing device, as opposed to fantasies about the shape of future error histograms, which are pretty much a complete fiction.
      
      But there is no need to take my word for it. Check it yourself. Assume a model y_i = m+e_i and that the e_i are IID N(0,1). Now assume the actual errors in the data are:
      
      e_i =-1,-.5,0,.5,1
      
      and that all future errors are equal to e_i =12+/-.2.
      
      These errors aren’t IID Normal at all, and future errors definitly aren’t N(0,1), but the errors in the data are in the hypersphere given by the high probability manifold of IID N(0,1). Now look at the interval estimates you get for m. Are those interval estimates good ones or not? Pick almost any non-independent, non-normal, non-random errors you like that lie in the hypersphere and you’ll get similar results.
- K? O'Rourke on July 16, 2013 12:03 PM at 12:03 pm said:
  
  Distrusting results from highly informative priors while trusting results from non-informative priors seems like a Worfian empty gas can story (people smoking by empty gas cans rather than full ones mistaking empty as less dangerous).
  
  Its their role, impact and justification on what you are trying to learn about that matters not their description (over the whole parameter space).
  
  Nice post and examples.
  - R McElreath on July 16, 2013 2:41 PM at 2:41 pm said:
    
    This comment is very timely, as I was just now revising a paper for press in which I try to argue that “noninformative” prior is a trick of language. i.e. there’s no such thing.
    
    Is there a citation for the Whorfian analogy? I am an anthropologist by training, so I know that gas can example well, and would enjoy using it in my publication. But I’d like to give proper credit, either to this blog comment or a publication, if one exists.
    - R McElreath on July 16, 2013 2:43 PM at 2:43 pm said:
      
      To be clear, I mean by “Whorfian analogy” the comparison of the famous gas can example to the use of non-informative priors. This is the first time I’ve seen that connection made, and it strikes me as rhetorically very powerful.
    - ? on July 16, 2013 4:46 PM at 4:46 pm said:
      
      My essay in my first year philosophy course was on linquistic relativity and although I discounted much of Whorf’s thesis, that example really struck a cord. Think Tom McFeet also was struck by it.
      
      Don’t see any need give credit, and I certainly have not published anything on it, but feel free to use these blog comments if you wish.
    - K? O'Rourke on July 16, 2013 4:48 PM at 4:48 pm said:
      
      Sorry hit the return key, if the ? mark makes it awkward just replave with Keith.
      
      Would be nice to here about what you write on it.
  - Jonathan (another one) on July 16, 2013 2:43 PM at 2:43 pm said:
    
    One distrusts results from highly informative priors when there is far less than consensus as to the information contained. No one objects to results from highly informative priors that everyone believes. If there is genuine dispersion of opinion about some parameter, then the result from a weak prior commands more respect than a strong prior. Andrew’s examples point to a strong prior backed up by data, while the weak prior is disconfirmed by data. That’s a different case.
    
    To take a topical (and possibly quite controversial) example, strong priors about state of mind might well convict George Zimmerman, while weak priors virtually guarantee his acquittal. In a world with a wide divergence of actual priors, don’t we want to use the weak prior? Indeed, isn’t at least part of the point of a jury to discover community divergence of prior beliefs and make a verdict contingent on the aggregated community prior, which might well be uninformative or highly informative in any particular circumstance.
Anonymous on July 16, 2013 12:05 PM at 12:05 pm said:

“The data analysis that started all this was based on a survey of about 3000 people. So it’s hopeless. ”

But the prior on sex ratios comes from a census. What if researchers had started out with the census, then measured beauty in a sample of 3000? Now the prior is part of the data, and a reviewer might have no problem including this info in the analysis.

My point is that sometimes you can avoid framing extra information as a prior by changing the nature of the study. Of course, the data are the data, but the framing for editors is different.
Brendon J. Brewer on July 16, 2013 4:57 PM at 4:57 pm said:

It’s interesting that people are troubled about the prior probabilities on parameter space but not the prior probabilities on the data space, which determine the likelihood function.

Both are important and both don’t come from the data set in question.
- Corey on July 16, 2013 10:16 PM at 10:16 pm said:
  
  People are more sanguine about models because there often exist statistics ancillary to the inference of primary interest that have uniform distributions under the model but non-uniform ones if model assumptions are violated. Let me give a concrete example to clarify (hat tip to Aris Spanos).
  
  Suppose I’m modelling a sequence of data y_1,…,y_n as IID normal. My interest lies in the mean; before reporting the usual standard error, I wish to check my independence assumption. One plausible way in which this assumption can be violated is if there are serial correlations. (As a Bayesian, I’d probably just use an AR(1) or AR(2) model, thereby expending some information in data to fit the serial correlation.)
  
  A frequentist statistician can avail herself of a non-parametric test: form the sequence of the signs of the residuals, y_i – y_bar, calculate the number of runs, and then apply the Wald-Wolfowitz runs test. Indeed, even as a Bayesian, I’d feel compelled to take note of a small test statistic: I know such an observation implies that the correlations I would fit in my AR model would have posteriors with little mass near zero.
  
  From a frequentist perspective, model assumptions can often be checked vis-à-vis the data and abandoned if shown inadequate. The situation with priors is not so clear-cut.
  - K? O'Rourke on July 17, 2013 8:08 AM at 8:08 am said:
    
    Corey:
    
    > The situation with priors is not so clear-cut.
    If you are not already aware there is some work on checking for prior-data conflict here https://www.utstat.utoronto.ca/mikevans/research.html
    
    If I understand Andrew, he prefers to check the whole joint model together rather than splitting it into components.
    
    It is very very convenient when information can be split out separately (independent) and that was one of David Cox’s favourite tricks (what lead him to proportional hazards regression) but it seems too hopeful (only safe in asymtopia land?)
    
    Also very related to wisely choosing those lower-dimensional projections and the prior over the rest of the parameter space.
Mayo on July 16, 2013 7:55 PM at 7:55 pm said:

There are plenty of ways to use background information in science without resorting to prior probabilities whose meaning is often so unclear. When I hear someone say, “While I am absolutely sympathetic to the Bayesian agenda I am often troubled by” their requirements, my question is “why, then are you sympathetic”?

As for needing “priors” to determine the likelihood function (as in the last comment), there’s no such thing. The fact that models are partial and approximate—quite deliberately–does not mean we assign “priors” to them. But we do wish to test how good a job they do, and for this we use the error probabilities of methods.
- Entsophy on July 16, 2013 8:47 PM at 8:47 pm said:
  
  “There are plenty of ways to use background information in science without resorting to prior probabilities”
  
  Yes but you’ll have to work incredibly hard to do so even in some very simple examples like this one: https://www.entsophy.net/blog/?p=55. The problem is that if the prior information is highly relevant, but has no effect on the sampling distributions, then frequentists are going to have to twist themselves in a not to include it (since according to frequentist principles it shouldn’t affect the results).
  
  “whose meaning is often so unclear”
  
  It’s only unclear to you and a few others who’s intuition was warped by the their first encounters with statistics which are almost always frequentists. The rest of us understand it fine. If they’re not understood by you, have you considered taking a philosophical approach and question some long held dogmas?
  
  “why, then are you sympathetic”?
  
  Uh… because of the absolute mass of amazing Bayesian applications that go way beyond the simple significance testing you like to champion. Since you and a few other die hard frequentists really seem to believe Bayesian statistics is total bunk this led to the bizarre idea that all these Bayesian success are secretly Frequentist successes. Didn’t you accuse Nate Silver, who used Bayesian statistics to assign a probability to a completely non-random variable like “Obama wins in 2012”, of secretly doing frequentist statistics. Absurd.
  
  “As for needing “priors” to determine the likelihood function (as in the last comment), there’s no such thing”
  
  I think you misunderstood. The point was that picking likelihood is pretty much indistinguishable in practice from picking a reference prior. Statisticians just pick something reasonable and standard and pretty much never check it in the way you claim is needed for “objectivity”. The way they’re checked is by checking major implications of model, which incidentally works just as well for priors as it does for sampling distributions. In other words the commenter was echoing a claim made by Gelman that people minutely scrutinize priors but will let any old thing pass for the likelihood.
  
  “for this we use the error probabilities of methods.”
  
  Uh no. There may be some cocoon of applications were people use methods to check them approved by the high priests and priestess of frequentists statistics, but there is a much bigger world of statistics, far away from simple significance testing in which people do all kinds of things which make no sense in your philosophy. And, rather inconveniently, they’re very successful at it.
  - Walt on July 17, 2013 1:42 AM at 1:42 am said:
    
    You mean if you look at the sample of people who do all kinds of things, they’re very successful? I’m impressed by your endorsement of frequentism, Entsophy.
    - Andreas Baumann on July 18, 2013 7:14 AM at 7:14 am said:
      
      Touché.
    - Entsophy on July 18, 2013 8:01 AM at 8:01 am said:
      
      Mayo made an implicit “for all” statement. One counter example is enough; that there are many is just gravy. It has nothing to do with samples or any kind of statistics. I believe the kids call it “logic” or something.
    - Walt on July 19, 2013 9:31 AM at 9:31 am said:
      
      Entsophy, all of your comments consist of inventing a straw frequentist position, and then vigorously arguing against it. What does this achieve, other than annoying us with your endless displays of spleen? Mayo is arguing from her specific philosophical take on falsification in a statistical setting, in the setting of science. Since not everything is science, there is no implicit “for all” that applies to all applications, so this is yet another straw position you’ve invented.
      
      There’s no reason why Bayesian methods can’t have good frequentist properties, a fact that has been elaborated at length for like 50 years now, so there’s nothing absurd about claiming that if Nate Silver is good at his job, then his methods must have good frequentist properties.
    - Entsophy on July 19, 2013 10:30 AM at 10:30 am said:
      
      Walt, Mayo’s position is that people should never do Bayesian Statistics, no one understands what priors mean, and that when Bayesians are successful, it’s because they’re secretly doing Frequentist stuff. Every bit of this is rank nonsense, which is why most Frequentists don’t hold these positions. But she’s giving this as sincere advice, from a position of authority on the subject, to newbies in statistics.
      
      Mayo’s claim that “prior probabilities whose meaning is often so unclear” is on the level of those who state “science can’t explain how bumble bees fly” based on a primitive erroneous calculation made by Newton 350 years ago, and then refuse to take seriously any modern Aerodynamics text that would set them straight.
      
      If you think Nate Silver’s “Obama has a 90% chance of winning the 2012” has good Frequentist properties then I eagerly look forward to the paper where the 2012 election is repeated a few hundred times to demonstrate this.
- Andrew on July 16, 2013 9:26 PM at 9:26 pm said:
  
  Mayo:
  
  I don’t mind if people want to use non-Bayeisan methods. If they are using Bayesian methods, I recommend they take advantage of the convenient way in which prior information can be incorporated into the analysis. In BDA and in much of my statistical practice, I think I have overemphasized noninformative priors. In recent years I’ve been moving toward more routine use of prior information, and I think I, and the profession, should be moving even more in that direction. The above post gives some reasons why I feel this way.
  - Anonymous on July 16, 2013 10:40 PM at 10:40 pm said:
    
    The question is whether the (implict) data informing the prior is exchangeable with the data at hand.
  - Rahul on July 17, 2013 3:18 AM at 3:18 am said:
    
    The problem with “more routine use of prior information” is that in many fields there is very little consensus about the prior to use.
    - Daniel Lakeland on July 17, 2013 11:29 AM at 11:29 am said:
      
      I dunno, I think many “uninformative” priors are so ridiculously uninformative that we can frequently make them much more informative without making them so informative that we cause fights over the prior. I’m thinking largely of the common practice of order of magnitude analysis or back of the envelope analysis that has been successful in many areas.
      
      Suppose for example that we have a necessarily positive parameter that represents something difficult to get your head around, such as maybe the relative risk of adverse events between a treatment and a control medication. If we don’t have prior information about which one will have higher risk, the logical thing to consider is a risk ratio of 1. How close is the risk ratio likely to be to 1? Usually before doing a trial there is a certain amount of checking of the performance of the drug in animal models and maybe a very small safety pilot in humans. If we’re even considering doing a trial of the drug it probably isn’t more than 100 times more likely to produce adverse events, unless the absolute level of risk is so low that even 100x the reference level is still small compared to typical drugs. So a prior that’s lognormal with the underlying normal having mean log(1) = 0 and standard deviation equal to log(100) = 4.6 (natural logs) is probably not a bad place to start. Compared to some kind of highly uninformative prior like say a half-cauchy with scale parameter equal to 10^6 this is a lot more informative
      
      This kind of reasoning can be done in many many situations where people currently use some kind of default “uninformative” prior.
Brendon J. Brewer on July 16, 2013 8:02 PM at 8:02 pm said:

“As for needing “priors” to determine the likelihood function (as in the last comment), there’s no such thing. The fact that models are partial and approximate—quite deliberately–does not mean we assign “priors” to them. But we do wish to test how good a job they do, and for this we use the error probabilities of methods.”

p(x|theta) (or p(x;theta) if you want to point out that you never intend to put in p(theta)) describes prior beliefs about what data you might see. To use such a model and not recognise that it is a probabilistic description of prior information can lead to a lot of confusion IMO.

“When I hear someone say, “While I am absolutely sympathetic to the Bayesian agenda I am often troubled by” their requirements, my question is “why, then are you sympathetic”?”

Because they want methods that give posterior probabilities as the output. For me, that’s the entire point of any statistical analysis. I’m not interested in methods that don’t give you that.
Gustav on July 18, 2013 9:55 AM at 9:55 am said:

If we use rather strong priors in our statistical models, do we then risk to mainly confirm earlier results and what if the first studies were completely wrong? Research is about independent replication. I have no problems with people using priors, but it requires that the effect of the prior is presented.
- K? O'Rourke on July 18, 2013 10:24 AM at 10:24 am said:
  
  If the expected replication was for all the parameters (exactly the same studies in exactly the same context) then this can arguably be assessed by post.1/prior being _consistent_ with post.k/prior for all k studies.
  (where post.k ~ prior * likelihood.k)
  
  Now the prior here falls out as long as they all agree on where the prior probability equals 0.
  
  So the prior does not matter at all.
  
  Less conveniently, expected replication is usually not for all parameters but just some (i.e. the control rate may differ in different studies with a treatment effect that should be similar ) and then the prior matters again (e.g. https://normaldeviate.wordpress.com/2013/07/13/lost-causes-in-statistics-ii-noninformative-priors/#comment-9480 ).
Rinke Klein Entink on July 19, 2013 6:13 AM at 6:13 am said:

I’ve been working together with researchers in immunology and allergenicity, where data collection is hard. So I asked my colleagues if they had information from literature etc., that might help inform the statistical model by formulating priors on some model coefficients. From a very applied perspective, I noticed that it did help us a bit in reducing the posterior uncertainty here and there, but more importantly, that my colleagues became much more involved in the whole modeling exercise. It required them to think in a more structured way about what was already known about the problem. We also ran into a prior-data conflict, which was nice too because it challenged their knowledge and the way they looked at the literature in their field. So I think it brought the statistics more to life for them, and made the statistical modeling less of a “trick that we don’t really understand anyway”.
Pingback: Entsophy

Comments are closed.