The traditional answer is that the prior distribution represents your state of knowledge, that there is no “true” prior. Or, conversely, that the true prior is an expression of your beliefs, so that different statisticians can have different true priors. Or even that any prior is true by definition, in representing a subjective state of mind.

I say No to all that.

I say there is a true prior, and this prior has a frequentist interpretation.

**1. The easy case: the prior for an exchangeable set of parameters in a hierarchical model**

Let’s start with the easy case: you have a parameter that is replicated many times, the 8 schools or the 3000 counties or whatever. Here, the true prior is the actual population distribution of the underlying parameter, under the “urn” model in which the parameters are drawn from a common distribution. Sure, it’s still a model, but it’s often a reasonable model, in the same sense that a classical (non-hierarchical) regression has a true error distribution.

**2. The hard case: the prior for a single parameter in a model (or for the hyperparameters in a hierarchical model)**

OK, now for the more difficult problem in which there is a unitary parameter. Or parameter vector, it doesn’t matter, the point is that there’s only one of it, it’s not part of a hierarchical model and there’s no “urn” that it was drawn from.

In this case, we can understand the true prior by thinking of the set of all problems to which your model might be fit. This is a frequentist interpretation and is based on the idea that statistics is the science of defaults. The true prior is the distribution of underlying parameter values, considering all possible problems for which your particular model (including this prior) will be fit.

Here we are thinking of the statistician as a sort of Turing machine that has assumptions built in, takes data, and performs inference. The only decision this statistician makes is which model to fit to which data (or, for any particular model, which data to fit it to).

We’ll never know what the true prior is in this world, but the point is that it exists, and we can think of any prior that we do use as an approximation to this true distribution of parameter values for the class of problems to which this model will be fit.

**3. The hardest case: the prior for a single parameter in a model that is only being used once**

And now we come to the most challenging setting: a model that is only used once. For example, we’re doing an experiment to measure the speed of light in a vacuum. The prior for the speed of light is the prior for the speed of light; there is no larger set of problems for which this is a single example.

My short answer is: for a model that is only used once, there is no true prior.

But I also have a long answer which is that in many cases we can use a judicious transformation to embed this problem into a larger class of exchangeable inference problems. For example, we consider all the settings where we’re trying to estimate some physical constant from experiment and prior information from the literature. We summarize the literature by N(mu_0, sigma_0) prior. In this case we can think of the inputs to the inference as being mu_0, sigma_0, and the experimental data, in which case the repeated parameter is the prediction error. And, indeed, that is typically how we think of such measurement problems.

For another example, what’s our prior probability that Hillary Clinton will be elected president in November. We can put together what information we have, fit a model, and get a predictive probability. Or even just use the published betting odds, but in either case we are thinking of this election as one of a set of examples for which we would be making such predictions.

**What does this do for us?**

OK, fine, you might say. But so what? What is gained by thinking of a “true prior” instead of considering each user’s prior as a subjective choice?

I see two benefits. First, the link to frequentist statistics. I see value in the principle of understanding statistical methods through their average properties, and I think the approach described above is the way to bring Bayesian methods into the fold. It’s unreasonable in general to expect a procedure to give the right answer conditional on the true unknown *value* of the parameter, but it does seem reasonable to try to get the right answer when averaging over the problems to which the model will be fit.

Second, I like the connection to hierarchical models, because in many settings we can think about a parameter of interest as being part of a batch, as in the examples we’ve been talking about recently, of modeling all the forking paths at once. In which case the true prior is the distribution of all these underlying effects.

Since you bring up the speed of light, consider the early crude measurements of it. Based on knowledge of astronomy K and measurements D1, people were able to get a Posterior P(c|D1,K) for the speed of light very early on. The crudeness though meant the Credibility Set for this was something like 200,000 -400,000 Km/sec, which is a big improvement over “I have no idea”, but still has substantial uncertainty.

A century later when more precise measurements D2 were made, they use P(c|D1,K) as their prior, get a new posterior P(c|D2,D1,K) which gives an interval 280,00-320,000 km/sec. Substantially less uncertainty.

Now the down-to-Earth, common sense, never-get-you-into-trouble interpretation of the Prior P(c|D1,K) is that it represents what the state of information “D1,K” has to say about c, and the interval 200,000-400,000 Km/sec is interpreted as an uncertainty interval which defines potential values for c reasonably consistent with “D1,K”.

But you’re saying this is wrong because this prior isn’t the

“true prior, [which] has a frequentist interpretation”and we should instead being searching for“the true prior [which] is the distribution of underlying parameter values, considering all possible problems for which your particular model (including this prior) will be fit”And you honestly believe this is going to bring clarity and progress to statistics?

A follow-up question: suppose the initial astronomical data had been some different D1′. Which lead to a later prior P(c|D1′,K) with interval 250,000-450,000 km/sec. But suppose it still lead to essentially the same posterior after the better mearsuments D2.

How does your “one true prior” explain that? You can’t just say the two priors represent different states of information. Somehow you have to claim there’s only one true prior but radically different priors can be used in practice and you can still get good results.

At the very least that limits the usefulness of your “one true prior” theory. The real question is which radically different priors can be used and still work great. The answer to that question has absolutely nothing to do with your frequentist interpretations.

>>>which radically different priors can be used and still work great<<<

Is that akin to asking: Which wrong models can still give the right answer?

No. Any prior P(c|I) for the speed of light will work well when P(c*|I) is high for the true value of the speed c*. The higher it is the better it will work.

Gelman and these other knuckleheads can give their philosophical opinions on how there must be a unique true prior until their blue in the face, but the inescapable mathematical fact is that any prior which makes P(c*|I) reasonably (relatively) high will work. Moreover it will work better the higher it is. In the limit, the best prior will this be a Delta function about c*.

Could you provide a mathematical proof of these inescapable mathematical facts, including careful statements of all assumptions?

Seriously? you can’t supply the proofs yourself?

I tell you what, how about this. Suppose you take one of Gelman’s

“distribution of underlying parameter values, considering all possible problems for which your particular model (including this prior) will be fit.”and presume this exactly known to you, and I’ll take my prior to be a delta function about the true value c* and let us both plug these priors into Bayes theorem and well see who gets a better posterior estimate for c*.Or how about you take your Gelman prior (again assumed exactly known) and I take my prior to be N(c*, 10km/s) and we’ll see who has the better posterior estimates in that case as well.

Well it’s not clear to me what you’re arguing and how that relates to Gelman’s point. I may be a knucklehead but I simply want a clear mathematical statement with associated clear assumptions and a clear mathematical proof of this statement so that I may avoid the dangers of philosophy.

When you say ‘In the limit, the best prior will be a Delta function about c*’ would you call this ‘best prior’ a ‘unique’ and/or ‘true’ prior? Or is it ‘best’ but not unique and/or not true?

@ojm I think it’s best in the sense that the loss function (presumably MSE or something reonasble) of the posterior predictive distribution is small. My interpretation is that if you care about the correct value, calibrating the prior can be harmful.

The argument for calibration is circular from that perspective. Statisticians want to do inferences best on frequentist properties. To do that, you need a calibrated prior. But what if you don’t want to make inferences based on frequentist properties in the first place? What if my model is a means of making a measurement or a prediction?

Omj,

I get it that you, Gelman, Hennig, frequentists and everyone else, want to argue philosophy. Your philosophy tells you there’s a “true” prior which in principle you should be striving toward. The closer you get to it the happier the statistics Gods will be. You’re all so enamored with your own philosophy, you have no need to examine the facts. But it’s fun every now and again to take a peak at those facts just for the hell of it:

For most problems there’s an infinite number of priors, which look nothing like your “true” prior however defined, which will work and yield usable results in the sense that the prior by itself wont actively mislead.

Moreover, there’s an infinite number of priors, which look nothing like your “true” prior, which will yield better, more accurate, estimates.

So please go ahead and regale us with your deep thoughts on the importance of “true” priors.

P.S. the request for proof is particularly obnoxious. A couple of weeks ago I proved a counter example to the claim that Bayesians never change the prior based on the data. Then like know, I cut through the philosophical nonsense with facts. But in that I case I actually took the trouble the prove it explicitly in the comments. The result was I called variously a child, troll, and coward (I kid you not). So no, I’m not going to waste my time proving simple stuff for you.

Laplace, I take it you are pretty influenced by Jaynes. What is your opinion of the maximum entropy method for coming up with priors? Isn’t the prior that maximizes entropy relative to your background information Jaynes’s version of “the true” prior?

a) I think it’s fine to change priors based on data.

b) I’ve appreciated answers you’ve given when they’re detailed. I even read you’re blog – I don’t always agree but I don’t always disagree

c) my request for a proof was (reasonably) genuine. Below I posted a paper by Wolpert of ‘No free lunch’ fame. There and elsewhere he gives mathematical reasons why naive overly-strong Bayesian claims don’t hold up. I tend to agree with these, even tho I still use Bayes, just like Wolpert

I just don’t think the philosophical claims made for Bayes hold up. It ends up with hand-waving about ‘high probability regions’ and ‘opportunities to learn’ when the model is wrong etc.

I think you have a point if the objective is to be confident that your uncertainty interval covers the actual value and one propagates the interval for any decision process. If the interval is overly wide, that just gets propagated. Doing so, one is technically not wrong that the true value is contained in the overly wide interval.

The problem comes when you use the interval itself as a basis for decisions, as is effectively done for hypothesis testing. Then whether the range is wide or narrow matters a lot.

I think part of the problem is that statistics has a tradition of making decisions and assessing confidence in projections based on inference procedures. For example, any (Bayesian or frequentist) hypothesis testing method.

I think this view makes sense, and it falls in line with your paper with Cosma Shalizi and predictive checking. Just as how the “true likelihood” matches the “true data generating process”, the “true prior” enables the prior predictive distribution to match a true class of data generating processes. As you put it, this makes sense when the problem is embedded as part of a class of problems. But when there is really only one problem the true prior is just a point mass reducing to the true likelihood. So I don’t see the problem with the speed of light example per say.

From the point of view of frequentist decision theory, the notion of a true prior is not unlike the notion of a true loss function. There is a well-defined sense in which one loss function may better represent someone’s ranking over decision rules (taking the hypothesis/parameter to be fixed), just as there is a well-defined sense in which one prior may better represent someone’s ranking over admissible decision rules (allowing the hypothesis/parameter to vary). To the extent that statisticians can debate the appropriateness of a ranking over decision rules (taking the hypothesis/parameter to be fixed), they can also debate over the appropriateness of a ranking over admissible decision rules (allowing the hypothesis/parameter to vary). To the extent that such disagreements are not rationally resolvable in the one case, they are not rationally resolvable in the other case either.

The purposes of specifying a loss function are different from the purposes of specifying a prior, but there is a symmetry in the role that they play such that the criteria we use to evaluate one are useful to evaluate the other. The mathematical formalism makes this symmetry plain, but I think that simply reflects the conceptual symmetry, which means that the notion of a true prior makes about as much sense as the notion of a true loss function, to whatever extent that makes sense in a given context.

“we can understand the true prior by thinking of the set of all problems to which your model might be fit”. I actually like this, it is essentially the view behind PAC-Bayes learning framework used in the machine learning community, which I initially didn’t like but came to terms with. However, this is still subjective, as defining a “problem space” is not something everyone will agree with (that’s not fundamentally an issue for me, as I don’t agree we can really get rid of subjectivity, be it on a prior or likelihood).

‘True prior’ doesn’t make sense to me, at least not in the way that I understand the word ‘true’ (I’m assuming that it is used in a similar way as in ‘true parameter’). While there are different ways to define belief in a Bayesian sense, I don’t see how the priors presented in the OP are not just the statistician’s beliefs based on a certain degree of knowledge, leading to particular assumptions. In the simple example, the urn model is still a belief, albeit a justified one (“often a reasonable model” sounds like a belief to me).

Perhaps I’m defining the word belief in too broad a sense as to make it practically meaningless, but I can’t see how a prior can be ‘true’ in and of itself.

OK, here’s an example. I am trying to deduce the average weight of male undergraduates at UC Berkeley, based on weighing a small number of them. The “true prior” in this case would be the distribution of weights of all of the male undergraduates at UC Berkeley.

I’d be very interested in hearing an explanation of why that is NOT the true prior distribution!

+1 This is exactly how I’d understand the term “true prior”.

Let m = weigh of male undergraduates at UC Berkeley. Let K =”the known (frequency) distribution of weights of all such undergraduates”.

Then in this case the prior P(m|K) is a delta function about the true value. Why? because given K you can calculate the answer you need directly without any statistics at all. So even in the extreme case where you knew K, the shape of the prior P(m|K) is extremely different from the shape of the (frequency) distributions of weights of all such undergraduates.

In more realistic scenarios, we might have K1=”strong but partial knowledge about average weights of undergrads in California”. In such a case, P(m|K1) might be a distribution concentrated about a range such as (160lbs, 200lbs).

Or we could be generally pretty ignorant, so we might have K2 = “general knowledge about humans”. In this case P(m|K2) might be a distribution concentrated about a range (100lbs, 300lbs).

Any notion of “True” prior, whether yours, Gelman’s, or frequentist’s implicitly contains the notion that if you deviate dramatically from the “true” prior something really bad happens. But it’s a mathematical fact that this isn’t the case. All three priors I just mentioned differ considerably from your “obviously” true prior and yet they will all work just fine in their respective problem domains.

I meant “Let m = average weight of male …”

Well, if what you are interested in is the average weight, then why isn’t the true prior (or true posterior, for that matter) simply the distribution that assigns all its probability mass to the true average weight of the population? This prior seems to me to have at least as much of a claim on being the “true prior” as the one you mention.

The “true prior” for the average weight is, yes, a delta function at the average weight.

The “true prior” for the distribution of weights is the actual distribution of weights.

Ask me another!

If you were trying to deduce the median weight, would the “true prior” be the distribution of weights?

If you were trying to deduce the maximum weight, would the “true prior” be the distribution of weights?

If you were trying to deduce the standard deviation of the weight, would the “true prior” be the distribution of weights?

In every case, the true prior distribution of male undergraduate weights is the actual distribution of undergraduate weights. The true prior of the point value that you’re interested in would be a delta function at the correct answer for that point value.

Ok. I thought you meant the prior for the parameter (in this case the average). Which is what I think the “true prior” in the post is about (“the prior for a single parameter”, etc). Or what I thought the post was about, because I understand it less and less.

I’m going to go with “*a* true prior distribution” is any distribution in which the actual parameter value we will converge on in the limit of large data is in the typical set of the prior. Any other definition is going to be seriously flawed I think.

Though I think there’s a confusion in terminology.

If K is a state of knowledge then p(theta | K) is “a true prior for theta given K” when it expresses the knowledge we have in K reasonably well.

K could be false knowledge though. So in that case, is it a “true” prior? It truly expresses K, but K is false information.

How about this: if K contains our knowledge about theta and p(theta | K) expresses the knowledge K approximately correctly, then if the theta we converge on in the limit of large data is in the typical set of p(theta | K) then K is true knowledge.

All we’re trying to do is reason from hopefully true (but usually nonspecific) knowledge K to more specific true knowledge K2 by collecting data and analyzing it in the context of a model (a likelihood).

agreed but see my above anon response to Laplace.

What does “the set of all problems to which your model might be fit” mean?

Let’s say I want to estimate the natural mortality for bluefin tuna in the northern Pacific Ocean. Does the “set of all problem” include the estimation of natural mortality for bluefin tuna in other areas? The mortality of yellowfin tuna? Of dolphins? Of any fish or mammal in any body of water? Of any living being?

Carlos:

Suppose that for this problem you have made the relevant parameters scale-free and have assigned normal(0,1) priors to them. Then the set of all problems etc is the set of all problems for which you would assign normal(0,1) priors.

Do you mean all the problems for which you would assign normal(0, 1) priors if you were assigning priors using some rational method (if so, which one?), or do you mean all the problems for which you would assign normal(0, 1) priors if you were to just follow your particular psychological inclinations in assigning priors to problems?

Olav:

I mean all the problems for which I would assign normal(0,1) priors in my data analysis, or perhaps all the problems for which I would recommend assigning normal(0,1) priors. It’s not about psychological inclinations, it’s about what I or other researchers actually do.

Thanks, I think I understand what you mean now. I’d say that what you are looking at is whether the prior is well calibrated, more than “true”. If I get your point, you’re thinking of how well the distribution covers the true value of the parameter in the long run (over different experiments all using the same prior).

The problem is that there are infinitely many “true” priors in that sense. Let’s say I know that the precise value of the parameter of interest is x0. I can define x1=rnorm(1,mean=x0) and use as prior normal(x1,1) and it will be “true” (in the long run, I’m assigning the right probability to the true value of the parameter). I can define x2=rnorm(1,mean=x0,sd=1000000) and the not-very-informative prior normal(x2,1000000) will also be “true”. The best prior I can choose is of course the one that better represents my knowledge. Which might not be well calibrated, but how would I know?

I find this explanation hard to wrap my head around because you define the true prior conditional on the set of problems. But then the set of problems implicitly uses the model (i.e., including the prior, but perhaps not the “true prior”?).

Maybe it’s just my statistics training but I think it makes sense first to define the set of problems, and second to define the model. So the set of problems is however one decides to embed the original problem—whether that be the estimation of natural mortality for other tuna, for other fish, for anything in a body of water, etc. Each of these problems induces a different true prior, as there is a different “true distribution of parameter values for the class of problems to which this model will be fit”.

I tend to agree – you might also be interested in the similar views expressed in David Wolpert’s (1996) ‘Reconciling Bayesian and Non-Bayesian Analysis’ (link – https://ti.arc.nasa.gov/m/profile/dhw/papers/51.pdf ).

Interesting paper, but…

“The implicit view in this extended framework is that inference is a 2-person game pitting you, the statistician, against the data-generating mechanism, i.e. the universe. Your opponent draws truths t at random, according to P(t), and then randomly produces produces a data set from t, according to $p(d|t)” (p. 3).

Is this supposed to be understood metaphorically? Clearly, the universe is not literally “drawing truths” at random according to some probability distribution. But if it’s supposed to be understood metaphorically, I still don’t understand how the metaphor is supposed to link up to reality.

I suppose it depends on whether you subscribe to determinism or not, but I don’t think the main point hinges on this. Rather (something like) – you always work with a guess which is not the truth and this leaves sufficient room to invalidate Bayes optimality claims. All Bayesian (and non-Bayesian) attempts to sweep this under the rug constitute additional assumptions not backed by any mathematics.

The point of imagining a ‘true prior’ is (to me) to make sure you keep this in mind. I don’t think we ever actually have access to the true prior in full detail – it’s there to loom over you and keep you from making claims that are too strong (and/or check your models and look for ‘opportunities to learn’). This may or may not be the same as Andrew’s point but I think there is a similar theme.

Ojm:

Yes.

Is it April 1 in another calendar???

:-)

I think the issue that Andrew is getting at, is that his concept that “statistics is a science of defaults” or something allows him to ask the question “how well do these defaults work on average?”

But, I really don’t see “statistics is a science of defaults” as a good framing of statistics. I want a particular custom analysis for every scientific question, based on real scientific information.

In that sense, every one of my priors is informative (to the maximum extent that I feel is justified by my knowledge set, which isn’t necessarily all that informative), and my main concern is whether I’m making unjustified assumptions in the *likelihood* function, which I see as describing the physical process that I’m trying to model.

And yes, I consider voting, and other social sciences as *physical processes*. Sure I don’t try to write down a Lagrangian, but when analyzing a social sciences problem I would more likely thing along the lines of “if you’re something like X then you tend to have certain life experiences and certain personality traits Q and those things tend to make you do Y more” rather than “what’s the effect of high income on voting preference”

But, honestly, I do tend to stick to more directly physical issues, damage to structures, biological molecule expression patterns, mechanical properties of materials, etc though I have done a few things related to economic decision making.

Daniel:

Perhaps we should distinguish between the

fieldof statistics and thepracticeof statistics. The field of statistics is all about defaults because it is about general practices, hence a frequentist orientation. The practice of statistics is about solving individual problems, hence a Bayesian orientation. From this perspective, an interplay between Bayesian and frequentist is central to statistics, as we move back and forth between general recommendations and specific practice.So is this post entirely philosophical or is there stuff here that guides us in selecting a “good” prior for practical problems?

Rahul:

The post is more about evaluation of inferences than choosing priors. The point is that inferences can be evaluated with respect to a true prior that represents some underlying distribution. Evaluations of Bayesian methods are typically performed either with respect to the assumed prior, or conditional on a single true value theta. I think it can make sense to evaluate averaging over some true prior, p(theta).

To the extend this post guides us in selecting a good prior in a practical problem, I think you want to think about embedding your problem in a larger class of problems. That’s one reason I like scale-free parameterizations. Also consider the idea of setting a prior for a parameter in some problem in psychology or medicine: here we have a long history of small effects.

Andrew:

You write: “for a model that is only used once, there is no true prior.”

Naive question: For your example regarding the speed of light, why isn’t the “right” prior for the speed of light, the actual, natural, universal speed of light, whatever that is?

Why isn’t the true prior simply whatever turns out to be the actual parameter in nature (even though we may not know it when we set up the model).

Is that not a constructive way to define things?

Rahul, if the “true prior” was well defined and unique I can’t see how it could be anything different that the actual value of the parameter (assumed unknown but defined) as you suggest. It’s even worse, because it seems a prior distribution can be considered “true” when it describes properly the distribution of the parameter (i.e. in the frequentist long run the distribution of the parameters corresponds well to this “true prior distribution”). It is therefore not unique.

From any “true prior” you can get an infinite set of (less informative) “true priors” by doing a convolution with an arbitrary distribution and shifting it back by random amount generated using that distribution. For example, from the “true prior” that puts all the probability mass in the actual value of the parameter {x0} we can derive another bimodal “true prior” which is with 50% probability {x0-1,x0} and with 50% probability {x0,x0+1}. This prior is “true” in the frequentist sense because half of the time the parameter will fall on our low guess and half of the time on our high guess, consistent with the distribution.

Carlos:

No, you’re talking about the true parameter value. I accept the idea of a true parameter value, indeed I think this concept is fundamental to Bayesian statistics. The above post is about the true prior distribution, which is something different.

Andrew: it’s clear I don’t understand your point then. I can only reiterate Rahul’s question:

If there is a true value for the speed of light “c”.

If I happen to choose as prior the point mass distribution at “c”.

Is this prior “true”? If not, why not?

My impression of what Andrew means by “the true prior” is more or less something to do with the idea along the lines of if 1000 people analyzed a given problem, they would each choose some prior, and maybe the average over these priors is some kind of construct that represents what “society” knows about the problem.

Or alternatively, if you have 1000 different stats problems and in each one you would choose a given prior p(foo) which is the same mathematical function in each case, then how well would this perform on average across the many different problems you apply it to?

or something like that

If you’re writing a computer program (aka a “default set of statistical rules”) to be applied without supervision, you could ask something like “how well does this rule work on average”?

It’s not a problem I’m particularly interested in, but I could see how someone whose profession was “recommender of generic statistics techniques” would maybe want to evaluate this sort of thing.

@Daniel:

Why would one ever call that the “true” prior?

Maybe a “consensus prior” or “most popular prior” or “expert-average prior” or “democratic prior”.

There’s a difference between truth and a popular vote. So I doubt Andrew meant what you write.

Also Carlos:

suppose there is a true value for the speed of light, “c”… Do you KNOW this value? If you do, you need not do any statistics on it. You can write it down as a constant in your model. That’s the same as putting a point mass prior on it.

But suppose you don’t know it exactly. Then, if you write down a point mass prior on it, you’re not being TRUE TO YOUR KNOWLEDGE. If you happen to get lucky and write down the exactly correct value, even if the value is correct, the prior does not truly represent what you knew. It’s a true fact, but not a true representation of your knowledge.

As for Andrew’s “true prior” it isn’t clear what knowledge he’s conditioning on. if K is some set of knowledge and he things p(param | K) = something_or_other, then either K does really tell him that something_or_other is his prior, or it doesn’t.

I think if he could explicitly say something like “the true prior is the prior we would arrive at if we got lots of knowledgeable people together and elicited as much information about the process as possible and encoded it as accurately as possible” or something like that, then the “true prior” means “true” in the sense of averaging over all the knowledge from all the people in the world or something like that.

in any other case I simply can’t make heads or tails of this idea.

Daniel:

Yes, my true prior is conditioning on some intermediate state of knowledge. It’s like what they call an “oracle” in frequentist statistics. I agree that it’s not precise, but in many contexts I think it’s helpful to think of a true prior. The speed of light example is not the best: as I wrote above, “for a model that is only used once, there is no true prior.”

Daniel, imagine that I do know the value which allows me to get my precise prior (which of course is trivial, but if we can’t even understand this “true prior” in a extremely simple case, how could we understand it in the general case?). Imagine that you don’t know the exact value, so you use a different prior. Is the “true prior” for my problem the same as for your problem? Is the “true prior” more true than my zero-entropy prior?

As you know already, I agree with your definition of prior as the representation of the knowledge that we have. But, like most people around here, I don’t understand what this “true prior” (which does not represents your state of knowledge, is unique and common to everyone, and has a frequentist interpretation) can be and what’s the point in postulating its existence.

@Daniel

So is the “true prior” the same as what people refer to as a “consensus prior”?

@Rahul:

You’ll have to ask Andrew, I don’t understand what he really means by “true prior”. I think it has something to do with the application of defaults, and how well those defaults work for the class of problems where those defaults are used, but I don’t really know.

This needs a motto. How about:

“you don’t need a perfect prior, just one true to the evidence it’s conditioned on”

But then that holds for any distribution.

There is quite a bit to be said for this suggestion, and also a few things against it.

A major advantage is that this approach makes the prior an (in principle) testable and falsifiable part of the overall model.

Also, very often, in applied Bayesian work, I see the sampling model interpreted and discussed as if it has a frequentist interpretation, and if this is indeed intended and the prior is interpreted in a different manner, probability reasoning becomes problematic because two different interpretations are muddled together in the same overall model. Interpreting all in a frequentist manner avoids this issue, although Bayesians aware of the issue can avoid it as well in other ways by being consistently subjectivist or, e.g., Jaynesian.

I also agree that this may help practitioners in many cases to think about and argue their priors, which is often not satisfactorily done in Bayesian practice.

However, I think that it is too naive to claim that such a true prior “exists”. People have already mentioned issues with the task to define the set of relevant problems well. Right now the definition is imprecise and as such open to interpretation. Then there’s this smell of circularity from using priors in the definition of the true prior which may (or may not) go away when things are made precise. I doubt that the definition actually can be made precise. Is there a true distribution of true parameter values of situations of any kind? Not if one doesn’t believe that there is any such thing as a “true model”. I am with Laurie Davies on this one: in a well defined sense, models can only ever approximate data (data are observable, “underlying truths” are not), and there is more than one model (and usually more than one parameter value for a given parametric model) “fitting” any dataset.

I think that “true models” and “true parameters” are potentially useful thought constructs. What is defined here is a mode of thinking about and constructing a prior, and as such this is fine, but it “exists” only by means of being properly constructed by human thought, and this will be possible in several ways for any given problem.

Christian:

I like these comments. I’d say that a true prior exists to the same extent that a true data model exists: it is defined in terms of some hypothetical set of replications.

Hi Christian,

as you may or may not know, I’m also a fan of Laurie’s work. But one thing is ambiguous to me – when you say ‘Not if one doesn’t believe that there is any such thing as a “true model”’ I would add that this doesn’t mean that there isn’t any such thing as the ‘truth’.

In fact I would say that things such as models have the property ‘not true’ by virtue of not having the property of being true. The idea of a ‘true prior’ in fact (to me) follows the standard non-constructive arguments usually associated with existence statements – it is the thing that all models are only an approximation of. An irrational number is a thing which is not expressible as a repeating decimal expansion etc.

Some people don’t like such non-constructive existence proofs but to me they capture the idea that reality always escapes our representations. If it didn’t then there would be true models!

Laurie’s approach (in my reading) is to switch attention from approximating the ‘truth’ to approximating the given data. As he notes, however, the best approximation to a given dataset is simply to reproduce that exact dataset. He then introduces the idea of ‘looking like’ the data based on a reduced set of ‘features’ of the data.

I like this idea of ‘data features’. It seems, however, that the point of choosing these features is (implicitly) to define a reference population with respect to which the analysis is being carried out. This seems much the same as how Andrew defines reference populations and leads back to some slightly ineffable but still inescapable ‘truth’ to which we are comparing a given dataset.

There are various levels at which we could be interested in whether a “truth” exists. Davies states, regarding his favourite copper example, that he thinks that there is indeed a true quantity of copper in the water. But this doesn’t define a true distribution of measurements that could be approximated by a model. It is hard to define what this true distribution would be. For example, if we talk about one-dimensional measurements, a well-defined distribution (that could be approximated by a modelled distribution) would have to refer to *independent* measurements, otherwise the distributional shape information is muddled with the dependence information. One could then argue that the assumption of independence itself is only approximately true (at best), but again in order for this statement to make sense there needs to be a definition of what is approximated. Everything may depend on everything else. Of course one could again have a model for this but if there is a complex (i.e., irreducible to the same dependence mechanism such as ARMA holding at all times) dependence structure that involves all data, the model is a model modelling all the data there is as a single realisation. What’s the frequentist truth that is approximated by this? I don’t think I’ll accept existence statements about such a thing in the foreseeable future.

There are situations in which the idea of a probabilistic truth can be argued in a somewhat more convincing way, particularly where there is a finite population and a well defined sampling mechanism.

For the “population” of studies to which Andrew refers I don’t think there is anything clear and easy to be had though. The study researcher A is currently doing isn’t in any well defined way drawn randomly from any well defined population of all studies in the first place. Then the population may change over time etc. etc. It has got to be an idealisation, and a fairly bold one at that.

Davies can give a proper formal definition of what it means to approximate data with a probability model. For any non-data “truth” I don’t see this happening. The only other thing that can clearly be done is to approximate a more complex model with a simpler one, but then still how does the complex one relate to “the truth”?

Personally I’d probably prefer to think of the population of studies of “the same kind” in some to be defined sense (the details of which could make all the difference), not just all studies for which someone could use the same prior. This would allow existing prior information specific to the study at hand to be incorporated. It’s again boldly idealised and not well defined either, but really, flexibility is probably not a bug but rather a feature here.

By the way, both subjectivists and Jaynes-type Bayesians (D. Lakeland?) could think that what I wrote in this posting are very good reasons not to use the frequentist interpretation of probability at all, but in fact they face very similar problems. The subjectivist has to “approximate” a “true” state of individual uncertainty, and if I use what D. Lakeland wrote about how the prior is fine if it just catches the true parameter value, we’re in pretty much the same place.

If I understand what you’re saying, I don’t have any problem with it. I too fail to see any meaning to “the true model” except maybe in extremely limited circumstances (for example the coulomb interaction model between two electrons?)

To me “the true prior” sounds a lot like “the glarb snabnitch” I just don’t have the slightest idea what it could possibly mean. I am not satisfied in any way with Andrew’s idea.

Andrew makes the distinction between statistics as field, making recommendations to the naive scientists who will use default methods vs statistics as practiced by a knowledgeable statistician doing a particular analysis of some particular problem. I find this basically unsatisfying as I don’t recommend that scientists go out and apply default methods to anything. It’s like saying we could apply default logical statements “IF A is true then B is true, A is true, therefore B” by simply plugging in statements for A and B ignoring whether any of the portion of the statement fails to make sense (such as “A is true”) and just moving on instead.

When I do Jaynesian style Bayesian reasoning, I’m ever mindful that my model could be substantially wrong. I’m especially mindful of it in the *likelihood* function, because that’s how I extract information from the data, and if I use one which has clearly wrong features, I will extract wrong information.

“The true prior is the distribution of underlying parameter values” What does this distribution represent samples from, replications of the study (or “similar” studies) with the same sample size? But then the true prior can be no more certain than your study will allow (i.e. low-power study will keep you uncertain), thus the true prior is not solely about the parameter but about your design, which seems wrong. You can of course have more prior knowledge than a single study might indicate.

I’m led to thinking that this must be implied by Andrew’s argument because, of course, the population quantity you are estimating, as long as reference class is fixed, is often fixed, too. Like the mortality of blue fin tuna, if you measure the entire population, you’ll get the exact quantity. And this point-mass distribution for the true value is pretty useless as a “true prior”, even as a thought construct. So I’m assuming that is not what is meant, thus it seems “true prior” is sensitive to study design, which again, also seems wrong…

As Christian Hennig and ojm have stated (more or less) you can, if careful, use the word true when talking about the world, but not when talking about parameters. If I have data on the quantity of copper in a sample of drinking water I am prepared to believe that there is indeed a true value so that I can reasonably talk about the true amount of copper cu in the water. Suppose now that I model the data as i.i.d. N(mu,sigma^2). I have to relate the true amount of copper cu which exists to a construct of my mind N(mu,sigma^2). I can do this by provisionally and speculatively identifying the true amount of copper cu with mu. I can make no sense of referring to mu as the true parameter. It would only make sense to me if in the world the amount of copper cu in the water comes attached with a Gaussian distribution with mean say mu_1 which is then the true parameter value of mu. This seems to me to be nonsense. But it goes further. I can equally well use the log-normal distribution with parameters (mu,sigma) which are completely different from the parameters (mu,sigma) of the normal distribution. I must now provisionally and speculatively identify cu with some function h(mu,sigma) of the log-normal distribution. Which function do I choose? Most of the discussants talk about priors over parameters. However for the copper example my prior is or should be a prior over cu the true amount of copper in the water. For the Gaussian model and the given identification I can simple use the prior for cu. This does not work for the log-normal distribution. If I am to be consistent my joint prior over the parameters of the log-normal distribution coupled with the identification I am using must result in my prior over cu. when talking about well defined numerical properties of the world, the amount of copper, the speed of light, the priors should be about these values. There are many different models with different parameters functions of which are provisionally and speculatively identified with the true values of the real world. There is no sense in which the parameter values can be called true.

lauriedavies2014, I’m not sure I follow. If cu is the actual amount of copper in the water, and N(mu, sigma^2) is your model of the measurements, then the true value of mu is simply the value of mu — call it mu* — such that mu*=cu, no? I mean, your model of the measurements may well be false, because it incorporates false auxiliary assumptions, but I don’t see why that necessarily implies that the parameter of interest (i.e. mu) itself isn’t perfectly interpretable and can’t have a perfectly sensible “true value.”

Olav, If cu is the actual amount of copper in the water, and LN(mu, sigma^2) (log-normal) is your model of the measurements, then the true value of mu is simply the value of mu — call it mu* — such that mu*=cu, no? I mean, your model of the measurements may well be false, because it incorporates false auxiliary assumptions, but I don’t see why that necessarily implies that the parameter of interest (i.e. mu) itself isn’t perfectly interpretable and can’t have a perfectly sensible “true value.”

Olav, try it another way. I provisionally identify the mean of the LN distribution with cu. The mean of the LN(mu,sigma) distribution is exp(mu+sigma^2/2). This givens mu+sigma^2/2=log(cu) and all such pairs (mu,sigma^2) which satisfy this are true.

OK, point taken. If you identify cu with the mean of an LN distribution, then only this mean — and not the parameter values of the LN model — can be said to be true. But you seemed to be making the stronger claim that values of model parameters can *never* be said to be true or false, even in the N model, and that claim seems too strong.

Olav, suppose you really could generate i.i.d. normal data which you can if you forget the pseudo bit in pseudorandom and if you ignore the finite precision of your results. If you do this for N(0,1) then I would accept that the true parameter values are (mu,sigma)=(0,1). If the copper were indeed in truth generated by such a mechanism there would indeed be true parameter values, otherwise not. A bag contains 20 white and twenty black balls. You draw a sample of size 30 with replacement. You model the number of white balls a binom(p,30). Can you make sense of p=0.5 being the true value. I think not. The parameter p belongs to a model and the model specifies i.i.d. drawings. How do you achieve this? There is nothing you can do to convince me that the drawings really are i.i.d. so there is no true parameter values even in this simple situation.

I’d like to distinguish three things:

1) The true value of the quantity of copper. This is actually some integer number of atoms, but of course we’re likely to measure it in mols and pretend it’s a continuous quantity since it has at least 20 odd legitimate decimal places at that scale.

2) The value to which in the limit of very large data and the application of Bayesian reasoning, your model’s parameter value will converge numerically. In many cases we intend to set up our models for the purpose of having the parameter value converge to the same value as in (1). But in your lognormal example it isn’t the parameter value that converges but some other function of the distribution (the average?).

3) The value of the parameters which would produce a frequency distribution for the measurements which was exactly equal (or extremely close to) the frequency distribution implied by assuming a *generative model* of IID draws from a random number generator. In basically every scientific case, this value DOES NOT EXIST because the data are not realizations from a random number generator, and I think this is Davies’ point above.

As for my own interpretation of what’s going on, I simply reject that I am (in the general case) writing down likelihoods which are based on modeling data as the output of a random number generator. So I have no misconception that “true” as in (3) exists, and I fully agree with Davies that there is no way to set the mu and sigma values for a normal distribution to ensure that his copper measurements take on a frequency distribution equal to that normal distribution. They’re just not values taken from a normal RNG!

For me, the likelihood is a function that expresses a notion along the lines of:

“if x is a *vector* of data and params is a vector of the numerical values of actual quantities in the universe, then p(x | params) which is a function I choose to represent some scientific knowledge that I have, is a measure of which values of the vector x I think are reasonable to arise from this scientific process conditional on “params” being correct.”

Note that I don’t actually restrict myself in my choice of likelihood to the joint distribution being a product as is the case in the “independent draws from a random number generator” model. For example, I’ve used gaussian processes for time-series data in which I consider the whole vector of data as a single n dimensional point. There is no notion of “replication” there. I have one point in N dimensional space, there will never be another point, and everything I’m doing in such a model is using my hopefully approximately correct scientific knowledge to find out about the numerical values of real quantities in the world. Though, sometimes my parameters do NOT represent real quantities in the world, that is also a case we’ve discussed here on the blog. For example in my dropping crumpled paper balls example, there is no sense in which there exists a correct radius of the spherical ball… since the ball isn’t spherical!!

reference for the dropping balls discussion:

Here: http://statmodeling.stat.columbia.edu/2013/07/25/bayes-respecting-experimental-design-and-other-things/

and several posts at my blog:

http://models.street-artists.org/?s=dropping+ball

I definitely agree that there are multiple senses of the “true parameter value” at play here. There is a fourth distinction to be made, I think (unless this is what you had in mind with your 1):

4) If we assume that mu is the parameter in the model that is intended to represent the quantity of copper, then the true value of mu is simply the value of mu that picks out the actual amount of copper. This value of mu may or may not be the value of mu that the Bayesian posterior will converge on in the limit (though ideally it should), and it may or may not be the value of mu that (together with other parameters) exactly replicates the frequency distribution of the data (though it probably won’t).

1) Is the actual value out there in the world… this is the true value of the quantity of copper. It’s unobserved, but it doesn’t have an existence as a parameter. The parameter is a concept inside the model, inside our heads.

4) Comes into play when as model-builders we want to identify a particular symbolic quantity in our mathematical model with the quantity that exists in the universe. So in the case of the copper, we may wish to construct a model in which a parameter we give a name to, let’s call it “mu” is identified with a quantity that exists in the world, whose numerical value is unknown, call it “cu”.

I think most of us who do Bayesian modeling in which we mentally identify a particular symbolic parameter with a real quantity in the universe would call our model “bad” if after sufficient data collection the posterior for that parameter was far away from the correct value that exists in the universe.

I however, would not call my model “bad” if the histogram of a gazillion measurements of cu did not have a normal distribution shape. I explicitly DENY that my likelihood functions are describing IID draws from RNGs. They’re describing *what I know* about the relative values that are typical for the measurements, not what the frequencies are.

Olav, Daniel Lakeland: I agree with Daniel Lakeland that 1) The parameter is a concept inside the model, inside our heads. I wrote ‘… a construct of my mind N(mu,sigma^2)’. I also agree that a prior over the true value of copper does not require a model. My prior over the real world makes no mention of randomness or chaos, I have this without the model. Once I start modelling and identify some function of the parameters with the values in the real world, then the resulting prior for this function should be my prior over the real world. If the result is bad then I may change my model but I am unlikely to change my prior over the real world, although of course I may just.

The interesting and amazing thing is that when we write down p(data | params) and call it a “likelihood” even if the data is not generated by a random number generator with a frequency distribution equal to this function for some particular set of params, you can in many cases still get good inference for real quantities in the world.

Once someone who is trying to understand Bayesian reasoning understands this, it clarifies many things.

Of course, it doesn’t happen ‘automatically’, as you’ve said before you need some regularization, some modeling choices that are reasonable, justified. I know you are fond of Fisher information, and I prefer maximum entropy. Did you ever read my blog post on maximum entropy in the context of nonstandard analysis?

http://models.street-artists.org/2016/01/14/differential-entropy-and-nonstandard-analysis/

Daniel Lakeland: Yes I did but then other things got in the way. Your latest blog post on the subject looks new to me. I’ll have another go.

Part of the issue here seems to be the relationship between uncertainty or partial information on the one hand, and actually existing reference class populations on the other.

I like this “definition” of probability from Turing, which appears to shamelessly mingle the two:

In this example — updating from Pr(Hitler lives to age 70 | Hitler is male) = 0.7 to Pr(Hitler lives to age 70 | Hitler is male; Hitler is 52) = 0.52 — the prior, likelihood, and posterior

allrefer to (the evidence provided by) actually existing reference populations.I like the Jaynesian approach to probability as partial information, but I don’t like his drawing of sharp lines between probability and frequency, because it seems to me that “partial information” is really defined by analogy with (symmetrical knowledge about/random sampling from) concrete finite populations.

Maybe notion of “true prior” could be reduced to this: conditioning on “knowledge of astronomy K” can’t easily be “cashed-out” with respect to a reference population. But when we condition on “Hitler is male” in the Turing example, the associated reference class population is obvious.

Determining ‘the’ reference class does seem to be the crux of the issue. A direct treatment and appropriately skeptical conclusion about that problem is in: Alan Hájek 2006 ‘The Reference Class Problem is Your Problem Too’ Synthese. http://philrsss.anu.edu.au/people-defaults/alanh/papers/rcp_your_problem_too.pdf (Laplace would probably approve, if he could overlook that it was written by a philosopher.)