The traditional answer is that the prior distribution represents your state of knowledge, that there is no “true” prior. Or, conversely, that the true prior is an expression of your beliefs, so that different statisticians can have different true priors. Or even that any prior is true by definition, in representing a subjective state of mind.
I say No to all that.
I say there is a true prior, and this prior has a frequentist interpretation.
1. The easy case: the prior for an exchangeable set of parameters in a hierarchical model
Let’s start with the easy case: you have a parameter that is replicated many times, the 8 schools or the 3000 counties or whatever. Here, the true prior is the actual population distribution of the underlying parameter, under the “urn” model in which the parameters are drawn from a common distribution. Sure, it’s still a model, but it’s often a reasonable model, in the same sense that a classical (non-hierarchical) regression has a true error distribution.
2. The hard case: the prior for a single parameter in a model (or for the hyperparameters in a hierarchical model)
OK, now for the more difficult problem in which there is a unitary parameter. Or parameter vector, it doesn’t matter, the point is that there’s only one of it, it’s not part of a hierarchical model and there’s no “urn” that it was drawn from.
In this case, we can understand the true prior by thinking of the set of all problems to which your model might be fit. This is a frequentist interpretation and is based on the idea that statistics is the science of defaults. The true prior is the distribution of underlying parameter values, considering all possible problems for which your particular model (including this prior) will be fit.
Here we are thinking of the statistician as a sort of Turing machine that has assumptions built in, takes data, and performs inference. The only decision this statistician makes is which model to fit to which data (or, for any particular model, which data to fit it to).
We’ll never know what the true prior is in this world, but the point is that it exists, and we can think of any prior that we do use as an approximation to this true distribution of parameter values for the class of problems to which this model will be fit.
3. The hardest case: the prior for a single parameter in a model that is only being used once
And now we come to the most challenging setting: a model that is only used once. For example, we’re doing an experiment to measure the speed of light in a vacuum. The prior for the speed of light is the prior for the speed of light; there is no larger set of problems for which this is a single example.
My short answer is: for a model that is only used once, there is no true prior.
But I also have a long answer which is that in many cases we can use a judicious transformation to embed this problem into a larger class of exchangeable inference problems. For example, we consider all the settings where we’re trying to estimate some physical constant from experiment and prior information from the literature. We summarize the literature by N(mu_0, sigma_0) prior. In this case we can think of the inputs to the inference as being mu_0, sigma_0, and the experimental data, in which case the repeated parameter is the prediction error. And, indeed, that is typically how we think of such measurement problems.
For another example, what’s our prior probability that Hillary Clinton will be elected president in November. We can put together what information we have, fit a model, and get a predictive probability. Or even just use the published betting odds, but in either case we are thinking of this election as one of a set of examples for which we would be making such predictions.
What does this do for us?
OK, fine, you might say. But so what? What is gained by thinking of a “true prior” instead of considering each user’s prior as a subjective choice?
I see two benefits. First, the link to frequentist statistics. I see value in the principle of understanding statistical methods through their average properties, and I think the approach described above is the way to bring Bayesian methods into the fold. It’s unreasonable in general to expect a procedure to give the right answer conditional on the true unknown value of the parameter, but it does seem reasonable to try to get the right answer when averaging over the problems to which the model will be fit.
Second, I like the connection to hierarchical models, because in many settings we can think about a parameter of interest as being part of a batch, as in the examples we’ve been talking about recently, of modeling all the forking paths at once. In which case the true prior is the distribution of all these underlying effects.