## What is a prior distribution?

Some recent blog discussion revealed some confusion that I’ll try to resolve here.

I wrote that I’m not a big fan of subjective priors. Various commenters had difficulty with this point, and I think the issue was most clearly stated by Bill Jeffreerys, who wrote:

It seems to me that your prior has to reflect your subjective information before you look at the data. How can it not?

But this does not mean that the (subjective) prior that you choose is irrefutable; Surely a prior that reflects prior information just does not have to be inconsistent with that information. But that still leaves a range of priors that are consistent with it, the sort of priors that one would use in a sensitivity analysis, for example.

I think I see what Bill is getting at. A prior represents your subjective belief, or some approximation to your subjective belief, even if it’s not perfect. That sounds reasonable but I don’t think it works. Or, at least, it often doesn’t work.

Let’s start with a simple example. You hop on a scale that gives unbiased measurements with errors that have a standard deviation of 0.1 kg. To do Bayesian analysis, you assign a N(0,10000^2) prior on your true weight. That doesn’t represent your subjective belief! It’s not even an approximation. No problem—it works fine for most purposes—but it’s not subjective.

More generally, think of all the linear and logistic regressions we use. Instead of thinking of these as subjective beliefs, I prefer to think of the joint probability distribution as a model, reflecting a set of assumptions. In some settings these assumptions represent subjective beliefs, in other settings they don’t.

This article from 2002 might help. If I could go back and alter it, I’d add something on weakly informative priors, but I still agree with the general approach discussed there.

P.S. Just to give an example of what I mean by prior information: The analyses in Red State Blue State all use noninformative prior distributions. But a lot of prior information comes in, in the selection of what questions to study, what models to consider, and what variables to include in the model. For example, as state-level predictors we include region of the country, Republican vote in the previous presidential election, and average state income. Prior information goes into the choice and construction of all these predictors. But the prior distribution is a particular probability distribution that in this case is flat and does not reflect prior knowledge.

One way to think about informative prior distributions is as a form of smoothing: when setting the parameters of a probability distribution based on prior knowledge, we are imposing some time smoothness on the parameters. I think that’s probably a good idea and that the Red State Blue State analyses (among others) would be better for it. I didn’t set up this prior structure because I wasn’t easily equipped to do so and it seemed like too much effort, but perhaps at some future time this sort of structuring will be as commonplace as hierarchical modeling is today.

1. joseph says:

Priors have to be based on objectively true statements of fact. But there is more than one objectively true statement I can make about your weight prior weighing. I could say:

(A) You weigh less than 100,000,000 kilograms

Or

(B) You weigh less than 2,000 kilograms

Both A and B are objectively true and I believe both of them simultaneously.

Therefore I can construct priors based on either of them. Which I choose is based of purely practical considerations such as “difficulty of implementing” vs “increase in accuracy” and has nothing to do with matters of principle.

2. Paul says:

When employing Expectation-Maximization, how do you characterize the prior distribution in terms of subjective vs. objective? Or is not even Bayesian, since iterating to convergence maximizes likelihood.

I don’t follow your argument. A pragmatic subjectivist Bayesian (like me) would say that if you make a good attempt at formalizing your prior beliefs as a prior distribution, and then draw conclusions based on the posterior, you will quite often get good results (though not always – sometimes the situation is just too complex to be able to do this, as is the case for inference from MCMC output, for example). If you try this in your example of weighing yourself on a scale, I think you’ll find that it works very well. I don’t think any subjectivist Bayesian ever claimed that using other methods will ALWAYS produce a bad result.

Regarding assumptions, if some of yours don’t come from subjective beliefs (which of course includes things you believe for reasons usually characterized as objective), then where DO they come from? Perhaps they come from a desire for computational tractability, but surely this has to be accompanied by a subjective belief that they’re close enough to the truth to be OK.

• Andrew says:

The prior, like the likelihood, is part of a model. The model can come from subjective belief, it can come from convention, it can be chosen based on statistical properties such as robustness, it can represent the summary of a literature review, etc.

I agree that your prior can represent (some approximation of) your subjective belief; I just don’t think it has to.

I do think it’s important to include prior information in a statistical model (in the “likelihood” as well as in the “prior distribution”); my problem is with the idea that the prior distribution necessarily represents a subjective belief. That is an argument that has been used to disparage model checking.

4. I like to explain a prior as the hypermodel learned from analyzing a bunch of past data sets – before encountering a new dataset.

“Subjective belief” makes it sound as if the life-long experience of a trained statistician was nothing better than some sloppy random subjective belief of an ignoramus.

5. Manuel Moe G says:

I am baffled.

Actually, in my day job, automating decisions about finite resource allocation for manufacturing, inspired by good Professor Gelman, I would:

(1) consider past similar situations, from my career, *quickly*, because read enough Gigerenzer to know my first stab at it would have the expectation of high quality, if not sufficient quality.

(2) make at least one model of “receiving data” that will receive the raw data under consideration this time – the model has the ability to be updated “on-the-fly” with each “chunk” of data

(3) make at least one model of “utility”, that will be used to make decisions about actions toward goals

(4) make at least one model of “receiving data of confirmation of quality of decision” that will receive data about the quality/success/effectiveness of the decisions made – make at least one model of “utility” of “confirmation of quality of decision” to make decisions about decisions

(5) set the relations between all these models – non-linear – a small change in one might mean a massive change in other

(6) adding “fuzz” and dropping data-points, repeatedly, to input measured data, as a cheap-and-cheerful sensitivity/sample-error analysis, so I am not fooled into thinking my answers are more definitive than they truly are.

(7) if need a causal model – follow our friend Professor Pearl – same if need to answer counter-factuals. Otherwise, statistical is preferred, iterative simulations if need be.

(8) The first few weeks will be a pilot run, then improve, possibly making the new version unrecognizable to the previous.

I am supposed to few guilty that good Professor Gelman is furiously shaking his head in disapproval, but I really don’t see any alternative consistent with all the issues considered over the many months of following Gelman’s blog.

• Andrew says:

Manuel:

Ummm . . . why do you think I’m shaking my head in disapproval?

• Manuel Moe G says:

Because, where is the prior in that setup?, where is the posterior? Where is the subjective segregated from the objective? I am baffled because these things exist in there, I can tease them out one by one, with work and with some license, but the model-of-models I described mixes them thoroughly.

And I am baffled because my lack of imagination prevents me from seeing how to do it, as a good Bayesian should, with objective posterior segregated from: (1) that which is the prior ignorant of the data set at hand, and (2) that which is subjective.

• Manuel Moe G says:

Also cutting and pasting from earlier comment (please excuse my filling your comment section with my overbearing graffiti):

I will be a monkey’s uncle if I had to state what my “informative subjective prior” was in such a setup, and defend its suitability. I would just fall back on “Compared to what?” – plainly state my personal lack of imagination to come up with a workable “conventional” alternative to get the automated decisions needed to fulfill business goals. And an argument from personal lack of imagination to come up with a workable “conventional” alternative is slightly embarrassing.

6. Manuel Moe G says:

Forgot to add to my comment awaiting moderation:

(A) add “fuzz” to model parameters, turn symmetry/smoothness/etc. expectations on/off, to see if decisions radically change, as cheap-and-cheerful sensitivity analysis

(B) collecting results of a number of runs with adding “fuzz” and dropping data-points is not just “cheap-and-cheerful” sensitivity/sample-error analysis, it is the only conceivable way to do sensitivity/sample-error analysis if the total model has graphical/iterative-simulation/non-linear aspects.

(C) I will be a monkey’s uncle if I had to state what my “informative subjective prior” was in such a setup, and defend its suitability. I would just fall back on “Compared to what?” – plainly state my personal lack of imagination to come up with a workable “conventional” alternative to get the automated decisions needed to fulfill business goals.

7. wv says:

I totally agree that the prior, like the likelihood, is part of a model. I don’t even see why so few people agree (at least in psychology). Is this exclusion of the prior of what is considered a model just an historical coincidence or is something deeper?

The resistance against this view is widespread. I just got a grant proposal rejected, which, among other things, promised to develop informative priors for several psychological models. The biggest problem for the referees seemed to be that I could not guarantee that I was able to construct “priors without any bias”. I wonder whether, if the grant proposal would have promised to develop a psychological model (without making explicit reference to a prior), the referees would have complained that the model could not be constructed “without any bias”. You would hope that people building models would rely on all relevant knowledge and theorizing they have at their disposal.

8. Nick Cox says:

For “Jeffreys” read “Jefferys”. Andrew knows the difference, really and certainly, but the typo is still there.

http://en.wikipedia.org/wiki/Harold_Jeffreys

http://en.wikipedia.org/wiki/William_H._Jefferys

I imagine that this typo has arisen for Bill Jefferys thousands and thousands of times, especially in Bayesian circles. I’ve seen the reverse too, that “Jeffreys” has often been mangled. Sir Harold preferred Jeffreys’s as a possessive as being less likely to mutate to Jeffrey’s.

• Indeed. It is fun to go to a Bayesian meeting and have newish Bayesians come up and ask me whether I invented Jeffreys priors :<)

Actually, even amongst those who know nothing about statistics (bank clerks and the like), my name will generally get misspelled as 'Jeffreys' unless I take care to get it put down right.

I met Sir Harold once, on the only trip he took to this side of the pond (he didn't like to travel, apparently). I was a grad student in the Yale astronomy department in the early '60s. Sir Harold was, amongst other things, an astronomer. He gave a lecture. He was not a good lecturer, unfortunately.

9. Andrew,

I don’t think your example really works. Sure, if you have a situation where the likelihood dominates so completely, as in your example, so that it hardly matters what your prior is, the posteriors are all going to be very similar. Although my subjective prior in such a case might be something like N(170,40^2), I’d probably say “screw it, the scale is so accurate that I don’t even need a prior, just report the weight it gives ±0.1 kg”. Nevertheless, I would not be wrong to use my subjective prior.

But consider another example. Suppose I’m a manager of a baseball team (major league); I’ve got a batter who has been at bat in the early part of the season 45 times, and has 15 hits, for a batting average of 0.333. I want to predict his average at the end of the season. How do I do it? Well, I read the article that Brad Efron and Carl Morris wrote many years ago in Scientific American. I recall enough from that article to know that a reasonable prior on batting averages of baseball players is something like a beta with mean around 0.24-0.25 and standard deviation around 0.035-0.040. If I use something like that as a subjective prior, (subjective because it depends on and accurately reflects information that I have as background information), then I can make a prediction about that batter’s average at the end of the season. I could even use that to make a point prediction, and if I bet that number as the final average against someone who bets that it will be 0.333 (closest number wins), there is a better than even chance that I will win the bet. I will have a good chance of winning the bet because I have not ignored relevant information that I have.

By the way, you didn’t address my question: Maybe I do not understand the “doctrine of subjective priors.” What is that? I don’t think of things in terms of some abstract doctrine. Like you, I consider the prior(s) and likelihood(s) to be part of the model, and I’m looking for the model(s) that best describe the data and make the best posterior predictions. And when considering priors, I’ll want to use priors that reflect my background information, if that’s going to materially affect the results.

• Andrew says:

Bill:

A subjective prior is fine. I just don’t think you have to use a subjective prior. You wrote, “It seems to me that your prior has to reflect your subjective information before you look at the data. How can it not?” I gave some examples of prior distributions that are not my subjective beliefs, that I would not bet on, etc.

• Andrew:

Do you think that it is OK to ignore prior information that You (subjectively) actually have, when it will make a significant difference in Your inference? That is, do you think that it is OK in that circumstance to use a “weakly informative prior” when you have better prior information?

I have no problem with priors that make no material difference, as in your example where the likelihood is so sharp as to make almost any prior give essentially the same posterior. That is the example that you gave where you would be willing to use priors that are not your “subjective beliefs, that you would not bet on, etc.”

I agree that You do not *have* to use a subjective prior under any circumstance, in particular when the prior has almost no influence on the posterior; but I do think that if You have prior information that You believe to be reliable, that would materially affect the inference (because the likelihood that Your model tells You that it is so), that You should use that information when constructing Your prior.

I think that you are overinterpreting my comment “It seems to me that your prior has to reflect your subjective information before you look at the data. How can it not?” I’m thinking here in the context of prior+likelihood models where it makes a material difference what the prior is, not in an artificial example where the posterior is virtually independent of almost any prior.

I am still waiting for you to tell me what the “doctrine of subjective priors” means. I have no idea about what you mean by this phrase. Stating what that is would help very much in my understanding of your point of view.

10. Joseph says:

If I know two things, call them I_1, and I_2, and need a prior for X then I can use either:

P(X| I_1, I_2)

or

P(X| I_1)

Neither is wrong in any way, although the latter is likely more spread out than the former (has a higher entropy). Both can be refuted by showing that I_1 is false. The former can be refuted by showing I_2 false.

Which is used is determined by convenience. If using the latter requires less effort and already answers whatever question I need it for then use it. If P(X| I_1) doesn’t have the fidelity to answer the question I’m interested in, then go through the extra effort to use P(X| I_1, I_2) and hope it does.

For the life of me I can’t fathom how any of this was ever unclear or controversial to anybody. I really can’t.

• johnbyrd says:

How do you that it already answers whatever question you need it for? Sounds circular to me.

• Joseph says:

Say you need to know if X>100. If you use P(X|I_1) and find X is in [90,120] then that doesn’t answer the question.

If you use P(X|I_1,I_2) and find that X is in [105,110] then that answers the question.

11. revo11 says:

I like the idea of thinking of a prior as basically a regularization tool for incorporating assumptions – smoothness / domain constraints etc.

So how does one conceptually understand the prior as an assumption? Conceptually, people tend to think of assumptions as inviolable – anything that violates an assumption is outside the solution space. Whereas a prior is _combined_ with the data, and (except for sharp priors) there’s generally no analogue to “violating” a prior. In that sense, a prior might be a “smoothed” assumption, but then, I also think people have a hard time with the idea of combining an assumption with data. Perhaps a distinguishing characteristic of bayesians and frequentists is that frequentists think of assumptions in as discretely holding or not within the model, while bayesians do not?

• Andrew says:

Revo11:

In statistics we are already comfortable with assumptions such as normality and logistic responses that are merely approximations, and also assumptions in data collection such as simple random sampling, so I don’t see the problem with further assumptions about the distribution of parameters.

• Joseph says:

“So how does one conceptually understand the prior as an assumption?”

Here’s a method for doing this:

(I) Imagine all you had was the prior (there was no data and no likelihood available, you probably do have this, but just imagine)
(II) Draw conclusions from the prior alone
(III) If those conclusions are compatible with known truths (not including the data) then the prior is fine. It they do conflict with known truths then you need a new prior.

Three examples will make this concrete.

Example 1: suppose I had a prior of N(mean=10,000kg, st=10kg) for Gelman’s weight. If all I had was this and nothing else then I would draw the conclusion “Gelman weights more than 5,000kg”. I can see from his picture this conclusion is false so I need a new prior.

Example 2: suppose I had a prior of N(mean=0, st=10,000kg) for Gelman’s weight. If all I had was this and nothing else then I would draw the conclusion “Gelman weights less than 20,000kg”. I can see from his picture that he does in fact weight less than 20,000kg so the priors fine.

Example 3: suppose I had a prior of N(mean=0, st=2,000kg) for Gelman’s weight. If all I had was this and nothing else then I would draw the conclusion “Gelman weights less than 4,000kg”. I can see from his picture that he does in fact weight less than 4,000kg so the priors fine.

So both 2 and 3 are fine in the sense that the prior encodes known truths, but 1 is not because it conflicts with them. Whether 2 or 3 is used is a purely practical question. The fact that the priors encode objective truths seems to throw the Subjectivists for a loop. And the fact that more than one prior can be formed from objective truths seems to throw Objectivists for a loop.

So this method has the advantage of both working and making everybody upset. Enjoy.

• fred says:

In Example 2 and Example 3 you’re putting half the prior on negative weights. This is in considerable conflict with the known truths, so by your argument they should fail just like Example 1 fails.

12. joseph says:

Very true. Get rid of them and work on one that implies his weight is positive. Perhaps use the fact that Gelman is human and doesn’t have an eating disorder, or his height if known. You could probably get a pretty tight prior before ever taking a measurement.

The point is that a prior, like any probably distribution, means whatever concrete conclusions you draw from it.

If I just handed you a distribution, and never told you it was a prior, and said “what conclusions would you draw from this?” then the distribution will make a good prior if those conclusions are true.

• James Annan says:

No, the conclusions will never be “true” unless the prior is a delta-function at the correct location (in the case of epistemic uncertainty concerning a physical fact). *Some* conclusions may be *reasonable* – indeed, all of them could be, if the prior really is a good one, but the examples you have provided are pretty horrible, for the reasons mentioned.

• joseph says:

Really? The conclusion “Gelman weights less than 20,000kg” isn’t true? The man’s a whale! I wonder how he types?

Incidentally, the incredibly horrible, ridiculous, despicable, and downright wrong prior N(mean=0,sd=20,000kg)is, for all practical purposes, equivalent to the uniform (improper) prior in this case and so the actual Bayesian 95% posterior interval you get using it will be identical to the 95% Confidence Interval (assuming the usual iid normal for the likelihood).

So that horrible prior gives the same final answer as the standard textbook solution for the simplest most common type of application of statistics to science (estimating the mean of a series of measurements).

Feel free to draw your own conclusions from that.

• fred says:

I think you understate the extent of the problem. Sure, for very particular models (e.g. estimating the mean in a Normal location problem) we can get away with putting “horrible” flattish or just improper flat priors on parameters. But under minor variants (e.g. a log-linear mean with i.i.d. Normal errors) this approach doesn’t work; using the flat prior the posterior is improper, and can’t be used for inference. Exactly how flat you make the “horrible” prior really matters, the inference drawn can be very sensitive to this choice.

Figuring out which weakly-informative priors do give well-behaved inference in complex (or even moderately complex) models is a major challenge; see for example Andrew’s work on priors for variances of random-effects distributions.

NB If the difficulties in doing good Bayesian analysis were as minor as your comments suggest, all good statisticians would choose Bayes for everything. They don’t, and this is an informed decision.

• Joseph says:

I didn’t mean to smooth over the kinds of real problems you mention. It’s just I had a very different, less practical, goal. I’ve seen quite a few statements here puzzling over priors.

For example revo11 above (who I read to be a Bayesian) said:

“there’s generally no analogue to “violating” a prior”

While Deborah Mayo (a Frequentist) has said in a previous post:

“One of the big problems in even thinking about testing priors is the unclarity as to what they are intending to measure”

I assumed that these puzzlements arise because both Frequentists and Objective Bayesians wan’t to think of distributions as testable, but priors have no hope of a frequency interpretation.

But you can actually test priors. All I wanted to do was convey the basic idea of how this was done to get people going in the right direction, not go into all the real world practical details.

So how do you test a prior like X~N(0,10000) which is the example Gelman used above? The trick is to forget about the distribution – which is something of a phantom anyway – and concentrate on what the prior is saying about the real world. Unlike the ameba of the prior, statements about the real world can be tested.

If you were given this distribution for X and told nothing about it and were asked “what range is X likely to be in?” Both Bayesians and Frequentists would say something like,

-10000*1.96 < X < 10000*1.96

Now thinking that X represents Gelman’s weight, we can now ask:

“Is Gelman’s weight in the interval [-19600kg, 19600kg] ?”

Well the answer is obviously yes. It’s not approximately true or “true” or reasonably true. It’s no kidding actually true. You might even say that the prior N(0,10000) encodes this truth.

And that’s why the inference doesn’t go haywire when you use it.

• James Annan says:

Joseph, you are being very selective about your interpretation of the prior. It also encodes the claim that Andrew Gelman is (with 99% probability) either larger than 250kg or has a negative mass. That is, it claims that he is almost certainly not in the usual range of human weights.

“Is Gelman’s weight in the interval (-∞,0) U (250,∞)”

Well the answer is obviously no, but the prior assigns 99% probability.

While it’s true that sufficiently precise data can sometimes overwhelm a crazy prior, that does not mean the prior was not crazy.

• joseph says:

James,

Both you and Fred hit on excellent points which weren’t covered by my loose comments, but would be needed in practice. To answer yours first,

A probability distribution for Gelman’s weight is really being truthful when it puts the true value of his weight in its high probability region. Why is that so important? because Bayes theorem will then map the high probability region of the prior to a (hopefully smaller) high probability region of the posterior. So a better way to express what a distribution is say about the real world is “the true value lies in the high probability manifold”.

The high probability manifold is going to be of the form {X : P(X)>constant} , which for a prior like X~N(0,10000) leads to intervals of the form [-delta,+delta].

That’s not the end of it though. Clearly if a prior like N(1000,1) puts its high probability manifold on [999,1001] for Gelman’s weight then it is going to be wrong and can be rejected. However there are lots of priors which put their high probability region over a reasonable area. Which to choose? Or does it even matter? Well it sometimes matters a lot.

For example, if the high probability region for Gelman’s weight’s prior is over the region [0,1000] it may still wiggle a lot at the top of the distribution. Once the likelihood focuses its gaze on those portions of the prior, then those wiggles can make big differences to the posterior. The problem is, like you suggest, that just because a prior agrees with what you know beforehand in one respect it many not do so in others.

So ideally, what you want in a prior is a distribution which encodes what you know to be true, but is somehow maximally non-committal about every else. In general, I believe this is hard to do. The only general principle I know of for do this is the Maximum Entropy Princple [MEP]

If what you know before hand can be expressed as constraints on the distribution, then by maximizing the entropy subject to those constraints, you’re insuring the prior agrees with what you know. On the other hand, by maximizing the entropy, you’re insuring the distribution is as non-committal as possible with respect to everything else. It should be noted that many effective priors in practice like Normals, Gammas and so on, are Maximum Entropy Distributions.

• joseph says:

James,

One more clarification about the high probability manifold. The ideal is for the true value of the weight to start out in the high probability manifold of the prior and then wind up in the smaller high probability manifold of the posterior.

If the true value starts out in the high probability manifold of the prior and the Likelihood is squared away then this will happen.

If the true value starts out way out in the tails, away from the high probability manifold, then it’s possible for the Likelihood to be good, but still not get the true value in the high probability manifold of the posterior.

13. fred says:

@Joseph: Frequentists don’t consider “what range is (the parameter of interest) likely to be in?” a sensible question. They view the parameter of interest as a fixed unknown, and they only use probability calculus to describe random variables. They *can* ask “how often would I expect this interval to cover the truth?” or “what are my chances of rejecting the null if the truth is that beta=42?”. But the question “how much belief should I have that beta is between -20000 and +20000?” can not be answered directly, using frequentist methods.

• Joseph says:

Sure. But I was imagining a scenario were you didn’t tell the frequentist that X was a parameter and that it was a prior. If they thought it was a sampling distribution for a R.V. X then they’d give the same numerical answer with a bunch of different verbiage.

It’s artificial but the real point was to imagine how you react to the distribution if you took it seriously as a distribution in its own right and not as merely an adjunct to the likelihood . I should have just left off any mention of Frequentists.