Skip to content

Prior distributions on derived quantities rather than on parameters themselves

Following up on our discussion of the other day, Nick Firoozye writes:

One thing I meant by my initial query (but really didn’t manage to get across) was this: I have no idea what my prior would be on many many models, but just like Utility Theory expects ALL consumers to attach a utility to any and all consumption goods (even those I haven’t seen or heard of), Bayesian Stats (almost) expects the same for priors. (Of course it’s not a religious edict much in the way Utility Theory has, since there is no theory of a “modeler” in the Bayesian paradigm—nonetheless there is still an expectation that we should have priors over all sorts of parameters which mean almost nothing to us).

For most models with sufficient complexity, I also have no idea what my informative priors are actually doing and the only way to know anything is through something I can see and experience, through data, not parameters or state variables.

My question was more on the—let’s use the prior to come up with something that can be manipulated, and then use this to restrict or identify the prior.

For instance, we could use the Prior Predictive Density, or the Prior Conditional Forecast.

Just as an aside on Conditional Forecasts, we used to do this back when I was at DeutscheBank using VARs or Cointegrations (just restricted VARs), simple enough because of Gaussian error terms—put yield curve and economic variables into a decent enough VAR or Cointegration framework, then condition on a long term scenario of inflation going up by 1%, or GDP going down -1%, etc. We can use a simple conditioning rule with huge matrices to find the conditional densities of the other, e.g., yield curve variables. Then we would ask, are they reasonable? do we think conditioning on CPI and GDP in 2y time, yield curves could be shaped like the mean +/- the standard deviation? In the case of looking at prior conditional forecasts, if they do not seem (subjectively) reasonable, the priors need to be changed. And if you can’t get anything that seems reasonable or plausible, probably the model needs to be changed!

Not knowing the literature on conditional forecasts, I used to call these Correlated Brownian Bridges, but it really is just a large dimensional normal random variable where you fix some data elements (the past), you condition on some (the future) or ascribe a density to them, and then find the conditional distributions of all the rest of them. Easy enough and works wonders in these macro-financial models where the LT unconditional econ forecasts are very bad but the LT yield curve forecasts conditioned on econ data are generally quite reasonable. This is also a very reasonable way to generate scenarios, preferably conditioning on as little information as possible.

Back at DB, we for instance could “show” that the most common yield curve movements (bull steepening and bear flattening, where say the 2y rate and 10y rate go down—bullish, and the 2s-10s slope steepens, or the 2y and 10y go up—bearish, and the 2s-10s slope flattens…yields and prices move inversely!), that these motions were largely related to demand shocks in the economy, where growth and inflation move together (typically the only shocks that the Fed really knows how to deal with), but that the atypical motions bear steepening and bull flattening seemed to coincide with supply shocks. Nowadays QE and financial instability would make for more complex set of conditions of course.

BTW, we liked these far more than the standard Impulse-Response approach to VARs that econometricians usually use. No covariance data, unreasonable impulses, arbitrary order. There is usually no means of disentangling effects in using impulse response. At least conditional forecasts give something you might be able to see in reality.

I have no idea whether one can easily put enough constraints on the priors to make them fully determined. If for some reason we do not, we probably still have to make them unique. This is more like having some information for an informative prior but perhaps not enough to make it unique (e.g., I have a Gaussian prior and only know F(mean,variance)=c but no more to determine each uniquely).

My only real foray into ‘objective’ Bayesian methods was to suggest that some objective criteria could be used to decide between many competing means and variances, at least as a starting point. Say MaxEnt subject to the subjective constraints, or like in Reference priors, minimize the cross-entropy between the prior and the posterior subject to my subjective constraints, etc. I’m afraid I don’t know how to “Jeffrey-ize” these subjective priors! I think Jeffrey-izing is an all or nothing method, unlike minimizing cross entropy, maximizing entropy, etc. I suppose we can take our constraints and find the unique identification via some symmetry argument much like Jefrreys’ method, but this is not so obvious.

Irrespective, the goal is not to have priors on parameters exactly since I think this is damn near impossible. I think nobody knows what the correlation between the state variables in time t vs time t+1 should be to make the model all that reasonable (well hopefully they are uncorrelated, but who knows?), and why should state space models all have the same prior? There are so many questions that can easily come up

The goal is to use the “black box” of the prior predictive density and the prior conditional density (the conditional in particular since you can look at model behaviour in a dynamic, scenario based setting) to inform us about how the informative priors should be constrained.

My actual contention here is—people do not have priors on parameters. They have priors on model behaviour. Parameters are hidden and we never ever observe them. But relationships in data, forecasts, conditional forecasts, all these are observable or involve observable quantities. And these we can have opinions about. If this identifies a prior, then great—job done. If it does not, we need further restrictions to help, which is where objective Bayes methods seem appropriate!

Please do let me know your thoughts. Again, I would tend to agree, there is no true objective. In reality there are many competing which all have their own merits (MaxEnt, Min Cross Ent, Jefrreys’ etc). You still must subjectively choose one over another! But using these methods in this subjective prior identification problem seems not completely loony.

I don’t have much to add here. In some settings I think it can make sense to put a prior distribution on parameters, in other sense it can make more sense to encode prior information in terms of predictive quantities. In my paper many years ago with Frederic Bois, we constructed priors on our model parameters that made sense to us on a transformed scale. In Stan, by the way, you can put priors on anything that can be computed: parameters, functions of parameters, predictions, whatever. As we’ve been discussing a lot on this blog recently, strong priors can make sense, especially in settings with sparse data where we want to avoid being jerked around by patterns in the noise.


  1. I have an intuition that this issue — whether you have priors on parameters or data — is related somehow to realism — the issue of whether you think of your models as somehow being literally true or approximations to truth.

    • oops I mean “whether you think of your models as being literally true (or approximations to truth) or else you think of them as being only effective descriptions of the data”

      • Yes this jibes with my own intuitions. When I model physical dynamics, balls falling through air, fluid flowing through porous media, mechanical linkages vibrating against each other, whatever, I think of my model as pretty closely organized around real stuff that happens, real energy exchanges and soforth. In doing biological models often I can put some realism on the parameters as well, variability between experiments can come from pipetting variability, measurement variability in the photodiodes doing the fluorescence measurements, temperature changes, random variations in the plasmid DNA whatever.

        I’ve done some financial modeling, and know people who’ve done a lot more than I have. When someone’s building yield curves or equity risk models, they’re really talking about the effect of aggregated actions of on the order of billions of people all making conscious decisions involving feedbacks between the different people. It’s as if you had say a billion mechanical linkages and a trillion interconnections… most of the things they’re likely to think of as parameters are themselves effectively some kind of aggregates over maybe millions of people and hundreds of millions of interconnections, or something like that. They’re not the same boat, or even on the same ocean as I am in my physical/bio sciences hat.

        • Anonymous says:

          This is a huge issue, and sometimes the lines get very very blurry and philosophical. Even what seem like physical quantities are aggregations if you consider the statistical mechanical basis for things like temperature, pressure, heat capacity, etc. Almost by definition, one scale’s “reality” is another scale’s “aggregate”.

          The question is, what to do with aggregation beyond scales that the human brain evolved to interpret – things at the scale of human populations, societies, cultures, etc. There, the distinction between a truthy fiction and a statistical hallucination can start to get really tricky. This comes up in some of Cosma Shalizi writings where he talks about “reification” in the context of things like genetics, IQ, and the meaning of correlational structure between them.

          • Anonymous says:

            (same anon)

            Another way of putting this is that what we consider “reality” is just another statistical aggregate. However, what distinguishes a “real” quantity from an “imaginary” one is whether it corresponds to a macroscopic observable on a scale that humans are familiar with and can recognize (e.g. temperature).

            When we aggregate data into parameters in applications in economics, sociology, or public health, we are not evolved to engage with the next scale up. Thus, we lack a clear selection criteria to say that “this aggregation corresponds to reality”, “this aggregation corresponds to fiction”.

          • I’m with you on this, much of my PhD research actually involves ways to construct continuum models as explicit statistical models of microscopic physical behavior. One nice thing about Newton’s laws is that they’re linear, so averages / totals are natural statistics in many mechanics problems. The decision-making behavior of groups is nothing like linear, so that averages are somewhat less applicable. Because of that, many of the aggregations we come up with for these social type phenomena are hopelessly “fictional”, possibly useful, but could never have the same kind of realism as for example “stress” being the aggregate of all momentum that fluxes through a small area in a unit time.

    • Anonymous says:

      Isn’t the idea for the two approaches to be equivalent? A prior on the parameters should imply a specific prior on the data, and vice versa.

  2. Jeremy Fox says:

    Subhash Lele has done some work that I think related to this issue, in settings where a paucity of hard data means that one has to rely on expert opinion. Lele suggests eliciting data rather than priors from experts. One of the benefits of his approach is that it leads to a natural way to quantify whether someone is an “expert”.

    And see here for links to various follow up papers:

    But I’m not a statistician, so I confess I’m unsure if Lele’s work actually gets at the issue raised in the post. Nor do I know the statistical literature well enough to say if anyone else has had similar ideas.

  3. ralmond says:

    This idea has been around for a long while, and is actually pretty sound. There is good reason to believe that subject matter experts (i.e., non-statisticians working with us to build models) no matter how comfortable they are with statistics are better off working in the domain of directly observable variables than only indirectly observable parameters.

    The reference I usually cite is:
    author = {Chaloner, Kathryn M. and Duncan, George T.},
    title = {Assessment of a Beta Prior Distribution: {PM} Elicitation},
    journal = {The Statistician},
    year = {1983},
    volume = {32},
    pages = {174-180},


  4. Anonymous says:

    Isn’t this an overly-complex, roundabout way of reinventing a suboptimal form of hierarchical modeling?

    You have information on particular outcomes you want to fit to constrain your prior. You then want to use that prior to get a posterior predictive distribution based on some other data.

    To find the “constrained” prior space, you still need to start with a some kind of diffuse hyperprior. So why not do the usual thing – model the latent structure which ties your prior information and your new data rather than rummaging around with these kind of ad hoc, information-isolating procedures.

    • Fernando says:

      The reason is usability.

      Yes, ultimately you need to formulate a prior. But good luck asking a nurse what is the distribution of parameters a and b in a beta distribution related to the chance of success of a specific standard of care.

      Prior elicitation is the “soaking and poking” aspect of the scientific method, to quote R.D. Putnam. And it is a key aspect to modelling.

      • Fernando says:

        PS just to be clear: Presumably it is the nurse that has the good prior, not you the statistician.

      • Anonymous says:

        You don’t ask the nurse for the a and b in the beta distribution, one intends to 1) estimate it from whatever prior observations they’re using to impose constraints on a and b, then 2) this a and b will serve as a prior for a posterior predictive distribution inference.

        So your prior parameters are coupled between the prior data and the new data through some latent structure. You don’t elicit a prior from the nurse, but use a flexible hyperprior on a and b + some latent model structure that makes predictions for both the prior observations and the “new” observations (they don’t have to be the same kind of observations, for example, the prior observations available may be a different measurement or some summary statistics from a prior study). Then fit all the data within the model.

        I agree at some point you have to stop and make some adhoc decisions about the model, but in this case, they’re trying to systematically estimate the prior, which seems to me to be re-inventing the hierarchical modeling wheel.

        • Fernando says:

          I am working under the assumption of no prior data, where all the prior information we have is in the nurse’s head/experience. (Hard to follow what the people at DB were doing.) Then you ask nurse what here experience (data) looks like, and back out the prior.

          My reading of this post is that it is about the user interface to Bayes, not the method. Doing the same but in a different, more intuitive way. But not very clear.

  5. Nick Menzies says:

    In my work I construct mechanistic models of infectious disease dynamics. The use of priors around outcomes has been adopted by some groups, under the title ‘Bayesian Melding’ (main development by Adrian Raftery). Note here there are priors around both parameters and outcomes. From my understanding it was a response to early UNAIDS HIV prevalence projections which were deemed implausible for various reasons. As the judgements about what was plausible or not were formed in the outcomes space, it seemed reasonable that the prior be operationalized there.

    I did not realize that Stan could accommodate priors on outcomes, and am really looking forward to trying it out with our kinds of problems.

  6. Very interesting topic. There are definitely certain applications where the parameter(s) quite literally exist and are in principle measurable, and other examples where they are not and the parameter just helps you write down a reasonable model for your prior beliefs.

    Sometimes in practice the parameter will be come known with certainty eventually. For example, the mass of a planet (you have to go pretty far before that concept isn’t well defined) or the fraction of people in New Zealand that have a BMI over 30, or which door has the car behind it in a game show.

    In other cases the parameters definitely don’t exist and are only put there so you can describe your prior beliefs in an easy way. For example if you wanted to describe your prior beliefs about a binary sequence 01010100100…, it is hard to assign probabilities on the space of sequences directly, but might be much easier if you think of the sequence as e.g. a sequence of Bernoulli draws with probability theta where theta is unknown. It’s the same in hierarchical models. Sometimes “populations” exist and it makes sense to think of parameters as being drawn from a population that exists. In other situations the device of saying there are hyperparameters etc is just a convenient way to assign sensible prior probabilities over a high dimensional space.

  7. ” In Stan, by the way, you can put priors on anything that can be computed: parameters, functions of parameters, predictions, whatever”

    That’s amazing. Presumably the autodiff stuff uses the jacobian to work out what the prior is on the parameter that is acting as a coordinate?

  8. Have you seen the new paper by Peter Hoff and coworkers on specifying priors that are highly informative for specific functionals and fairly diffuse otherwise?

  9. In some respects this is a very reasonable thesis and flips the ‘non-informative’ prior question on its head.Rather than complaining that a non-informative prior will not be so in a differaent parameterization, one can should think about the observable (data) and specify the non-informative prior in that space.
    Even simple problems can benefit from this approach: if one is using a digital measurement array to study two states, then one can invoke a standard uniform prior one the actual measurement of each state over the dynamic range of the experimental apparatus:

    x1~U(0,R) & x2~U(0,R)

    If one were then to set this up as a regression problem the corresponding prior on d=x1-x2 is a rather informative symmetric triangular distribution on (-a,a).

  10. Chris says:

    If one is giving the parameters of a model uninformative priors, I think it is always a good check to see what this implies about the quantities one wants to have posterior distributions for. I.e. One just does all the analysis without the data and if everything is OK then quantities of interest will have uninformative (uniform usually) “posteriors”. If they don’t one can go back and see if there is a another plausible uninformative prior on the parameters which gives an uninformative posterior distribution on the quantities of interest when no data is used.

  11. AV says:

    “no theory of a “modeler” in the Bayesian paradigm”

    Anyone wishes there were ?

  12. I know there were some comments on this being hierarchical. I personally see nothing hierarchical in his method. Rather this is pure prior elicitation through the route of asking the statistician or the expert if you wish his views on an observable forecast rather than on some unseen and almost unknowable set of parameters.

    Btw I see no point in having uninformative views in data space. That is quite unrealistic. The purpose was to find functionals or transforms of the parameters about which one would or should have opinions. I would think most economists are comfortable telling you mean and std
    This bears some similarities to the DP Nonparametric paper Andrew quoted since we may just be specifying a set of functionals of the parameters, constraining them or putting some informative prior on them while there is effectively a set of orthogonal functionals which we may have no opinion on. Rather than to assume that the non informative directions should be endowed with a dirichlet prior I would suggest perhaps these should be truly noninformative via a maxent type optimisation or a min cross entropy optimisation (a la reference priors) to allow for a unique (possibly improper prior). The alternative would be to have multiple informative and/or uninformative priors each of which satisfies the constraints, but introducing uncertainty (i.e., no fixed unique probability) probably makes little sense at this juncture. Best to just stick with one prior.

    Unlike the Bayesian melding literature which seems to only looks at forecasted means, I find it is relatively easy in the case I am looking at to forecast the entire density (or to simulate from it).

    I am looking specifically at a VAR–a vector auto regression and will give it normal iid innovations and conjugate priors throughout so while the notion of computing the ful conditional density seems challenging merely in terms of the size of the matrix computations, all are fully tractable and the densities and joint densities of forecasts conditional on the terminal point of all or some of the variables are straightforward to compute. As I said this is a bit like a Brownian Bridge. Very easy.

    I can then have subjective views on the conditional mean and variance of the remaining variables for all forecast horizons and by imposing these views will effectively constrain the parameters and effectively elicit a prior which may or may not be proper depending on the total number of views I impose relative to the total number of free parameters). I see it as quite simple computationally but perhaps more meaningful philosophically.

    It just seems more natural to have a view on something I can see, observe or experience such as data, rather than some derived quantity or hidden variable such as a model parameter, which may have a physical interpretation (e.g., the mass of a planet of proportion of Kiwis with high BMIs) or may not at all, and may or may not be measurable and may not even make much of a difference to what the model actually does claim to model (shouldn’t we be putting entirely vague priors on nuisance parameters? If they really don’t matter we shouldn’t be trying to have informed views on them after all).

    All I should ever really care about is what I can see and the model should behave more or less as I would expect (at least the prior predictive densities and prior conditional forecasts should….the posteriors are entirely different question). In some sense all parameters are really merely nuisance. It is data and observations we truly care about.