For each parameter (or other qoi), compare the posterior sd to the prior sd. If the posterior sd for any parameter (or qoi) is more than 0.1 times the prior sd, then print out a note: “The prior distribution for this parameter is informative.”

Statistical models are placeholders. We lay down a model, fit it to data, use the fitted model to make inferences about quantities of interest (qois), check to see if the model’s implications are consistent with data and substantive information, and then go back to the model and alter, fix, update, augment, etc.

Given that models are placeholders, we’re interested in the dependence of inferences on model assumptions. In particular, with Bayesian inference we’re often concerned about the prior.

With that in mind, awhile ago I came up with this recommendation.

For each parameter (or other qoi), compare the posterior sd to the prior sd. If the posterior sd for any parameter (or qoi) is more than 0.1 times the prior sd, then print out a note: “The prior distribution for this parameter is informative.”

The idea here is that if the prior distribution is informative in this way, it can make sense to think harder about it, rather than just accepting the placeholder.

I’ve been interested in using this idea and formalizing it, and then the other day I got an email from Virginia Gori, who wrote:

I recently read your contribution to the Stan wiki page on priors choice recommendations, suggesting to ensure that the ratio of the standard deviations of the posterior and the prior(s) is at least 0.1 to assesss how informative priors are.

I found it very useful, and would like to use it in a publication. Searching online, I could only find this criteria in the Stan manual. I wonder if there’s a peer reviewed publication on this I should reference.

I have no peer-reviewed publication, or even any clear justification of the idea, nor have I seen it in the literature. But it could be there.

So this post serves several functions:

– It’s something that Gori can point to as a reference, if the wiki isn’t enough.

– It’s a call for people (You! Blog readers and commenters!) to point us to any relevant literature, including ideally some already-written paper by somebody else proposing the above idea.

– It’s a call for people (You! Blog readers and commenters!) to suggest some ideas for how to write up the above idea in a sensible way so we can have an Arxiv paper on the topic.

19 thoughts on “For each parameter (or other qoi), compare the posterior sd to the prior sd. If the posterior sd for any parameter (or qoi) is more than 0.1 times the prior sd, then print out a note: “The prior distribution for this parameter is informative.”

  1. So by increasing the prior sd, you can get rid of the note/warning. Folks might interpret that to mean that flat priors are non-informative and “safe”. That’s not what you want, right?

    • The general notion of prior sensitivity is the derivative of some posterior estimate (expectation, quantile, etc.) w.r.t. some prior parameter. I’d recommend the work of Ryan Giordano on prior sensitivities and its relation to variational inference and Laplace approximations: http://www.jmlr.org/papers/volume19/17-670/17-670.pdf

      Coincidentally, the paper Jonathan (another one) cites is about DSGE models. We’re about to start working on those more intensely in Stan now that we have the basic algebraic and differential equation solvers in place. Next up is the combination differential algebraic equation solver, which I think Yi Zhang has finished coding. Charles Margossian has been working with Shoshana Vasserman specifically on economic equilibrium models and Charles is busy exploring better solvers (probably requiring higher-order derivatives somewhere). Then we add stochasticity (that goes well beyond my stat theory and applied math level) and perhaps even PDE solvers (which go beyond my applied math level). It’s going to take a while to find good algorithms for all this stuff that we can differentiate through.

      Essentially, the way Stan handles HMC is by calculating sensitivites (derivatives) of the posterior to all of the parameters (not just the prior parameters).

  2. This reminds me of the visualization paper of Gabry et al. in that it’s looking at prior predictive behavior to see if it’s reasonable.

    Andrew’s proposed notion of informativeness is a bit odd from a Bayesian perspective in that it’s not about what I know about the problem domain, but about the size of the data. I think Andrew’s asking something like when the data measureably overwhelms the prior, or when the “data speaks for itself” or when one might avoid the criticism of “putting one’s thumb on the scale”.

    The metric we’re looking for would measure concentration and compatibility of posterior with prior. That needs to account for the kind of sensitivity you get with incompatible priors. Sensitivity here means the derivative of posterior mean(s) w.r.t. prior parameter(s).

    The following little experiment was inspired by a visualization of sensitivity Duco Veen was showing me when he was here to illustrate incompatible priors and their effect on concentration of measure.

    I simulate 10 observations

    y[1], ..., y[10] ~ normal(0, 1), 
    

    and then used a model

    y ~ normal(mu, sigma), 
    

    with priors

    sigma ~ normal(0, 10) 
    mu ~ normal(mu_mu, sigma_mu)
    

    Here’s a quick table; it should really be a heat map with a line at the informativeness boundary (actually multiple heat maps with varying sizes of data).

    prior               posterior (approx)   informative
    ----------------    -------------------  -----------
    normal(0, 0.1)      normal( 0,    0.09)            *
    normal(0, 1)        normal(-0.02, 0.34)            *
    normal(0, 2)        normal(-0.02, 0.35)            *
    normal(0, 3)        normal(-0.02, 0.36)            *
    normal(0, 4)        normal(-0.02, 0.36)
    normal(0, 10)       normal(-0.02, 0.36)
    normal(0, 100)      normal(-0.02, 0.36)
    
    normal(0, 4)        normal(-0.02, 0.36)
    normal(1, 4)        normal(-0.02, 0.36)
    normal(5, 4)        normal( 0.02, 0.37)
    normal(10, 4)       normal( 0.06, 0.37)
    normal(15, 5)       normal( 0.10, 0.38)
    normal(20, 4)       normal( 0.15, 0.41)             *
    normal(100, 4)      normal(94,    4)                *
    

    The prior isn’t very sensitive to prior scale between normal(0, 1) and normal(0, 100), but it only stops being considered informative at normal(0, 4).

    On the other hand, the prior is pretty sensitive between normal(0, 4) and normal(15, 4), but is only considered informative at normal(20, 4).

    When do we consider a prior not compatible with the data? With normal(10, 4), the posterior mean has a z-score of 2.5, but with normal(20, 4), it’s a z-score of 5, which is getting out there, whereas clearly with normal(100, 4), we’re beyond any notion of prior compatibility.

    Methodology Section

    Here’s the Stan code.

    data {
      real mu_mu;
      real sigma_mu;
      int N;
      vector[N] y;
    }
    parameters {
      real mu;
      real sigma;
    }
    model {
      y ~ normal(mu, sigma);
      mu ~ normal(mu_mu, sigma_mu);
      sigma ~ normal(0, 10);
    }
    

    Here’s the driver code.

    library(rstan)
    model <- stan_model('normal_sens.stan')
    N <- 10
    y <- rnorm(10)
    
    prior_test <- function(model, y, sigma_mu, mu_mu) {
      data <- list(N = length(y), y = y, mu_mu = mu_mu, sigma_mu = sigma_mu)
      fit <- sampling(model, data, iter=1e5, refresh = 0)
      print("", quote = FALSE)
      print(sprintf("sigma_mu = %10.5f  mu_mu = %10.5f", sigma_mu, mu_mu), quote = FALSE)
      print(fit, prob = c(), digits = 4, pars = c("mu", "sigma"))
    }
    
    for (log10_sigma_mu in -2:2)
      prior_test(model, y, mu_mu = 0, sigma_mu = 10^log10_sigma_mu)
    
    for (log10_mu_mu in 0:4)
      prior_test(model, y, mu_mu = 10^log10_mu_mu, sigma_mu = 4)
    

    I should've used seeds, but in lieu of that, here's my simulated y,

    > y
     [1] -2.0875598  1.2460557  0.4382849  0.4323869  0.4206121 -0.3414313
     [7] -0.1056257 -0.1234214  0.8128717 -0.9181756
    
  3. I agree with Bob that this seems a measure of the amount of data compared to the amount of “prior data”, but ignores whether the “prior information” does confirm or contradict the information provided by the data. It could be argued that the overlap of prior and posterior is in some sense relevant in discuss of how “informative” the prior actually was.

    Using the typical Gaussian example (with location mu unkown and scale sigma=1) a prior Normal(0, 10) would never be informative because even with single data point X the posterior would have standard deviation 0.995 (i.e. 9.95% of the prior standard deviation). The same is true for any normal prior with standard deviation larger than 10.

    If the prior is Normal(0,1), which is equivalent to the information carried by one single prior data point X=0, then the prior would be informative if the data consists of less than 99 points. For N=99, the standard deviation of the posterior would be 0.1, exactly 0.1 times the standard deviation of the prior. And the estimate for mu would be 99 times closer to the average of the data than to the center of the prior.

    In this normal example, the condition sd(posterior)>0.1*sd(prior) means that unless the precision of the likelihood is at least 99 times the precision of the prior the latter will be informative. You need one hundred times more data than the quantity of data implicit in the prior to say the prior is uninformative ()

    And the “informativeness” would be the same regardless of how far the prior is from the data. If we consider the following examples (as before: sigma=1, N=99), both sitting at the suggested threshold of 0.1:

    a) Prior: Normal( 0 , 1 ) / Likelihood: Normal( 0.1 , 0.1 ) / Posterior: Normal( 0.1 , 0.1 )

    b) Prior: Normal( 0 , 1 ) / Likelihood: Normal( 10 , 0.1 ) / Posterior: Normal( 9.9 , 0.1 )

    c) Prior: Normal( 0 , 1 ) / Likelihood: Normal( 100 , 0.1 ) / Posterior: Normal( 99 , 0.1 )

    In the first case, the posterior is almost identical to the data and completely included within the prior.

    In the second case, the posterior is far from the prior but has substantial overlap with the likelihood (center displaced by one standard deviation).

    In the third case, the posterior is far from either the prior or the likelihood.

    • I am wondering a) the prior can be informative both in the scale and in the shape. For example, an inv gamma prior (2,1) and Half normal (0, sigma) can have the same prior variance for group variance tau squared, but the former one indeed excludes the mode around 0 and may therefore shrinks the posterior sd. By simple prior-posterior sd comparison, we are more likely to say that half normal one is “more informative”, even if it more allows small tau and potential bimodes?

      b)what is the prior sd in a hierarchical model? Say theta ~ N(mu, tau), and mu and tau has their own priors. One can simulate the prior sd for theta by simulating tau and mu first and then simulating theta and computing its sd. But it means as the structure of the hierarch goes deeper and deeper (say another hierarchical priors on mu and tau), the prior will become weaker and weaker?

  4. This seems like Mike Evans’ stuff to some extent eg

    – measure data info by posterior relative to prior
    – if data info small (posterior similar to prior) then the estimate is driven by prior

    He also has some stuff on prior-data conflict

  5. Gori, if you do read the comments and do cite this then please include argument to justify the idea in your own case and don’t just make it an appeal to authority.

    Andrew, please, please don’t put this into a publication anywhere. Even the capitulation to put it here as a reference, without qualification, isn’t a good idea. I see where you admit it’s untested but that’s the problem.

    I’m not arguing that the idea is bad in itself. However, Cohen’s effect sizes, .05, and others, have all become unthinking “rules”, supported by references, as the thing to do. There’s very little support here for the idea and, of course, due to lack of exploration of it, no qualification of it. A better blog post for people to cite might be the exploration of this as an idea, not to focus on the 0.1 but instead to describe how one might go about arguing a prior is informative for a particular dataset in general with, at the very most, 0.1 as an example.

    (And I do like the idea of looking at the posterior and arguing the prior is informative. I don’t see anything wrong with admitting the placeholder or default prior affects things, or perhaps is similar to the actual distribution.)

  6. I wonder if the KL-Divergence between the prior and posterior could be used as a measure of how informative a prior is. Although, philosophically, how “informative” a prior is sounds a bit upside-down to me, as it’s the data that updates our prior into the posterior. We would expect to learn KL(posterior||prior) bits of information after updating the knowledge we had before seeing the data, but asking about how informative is a prior seems to flip this to KL(prior||posterior).

  7. Let’s think about what it means for a prior to inform us about a parameter. The prior specifies a region of space in which we think the parameter value lives. In addition to the concept of a binary membership (in or out of the set) it provides a “weight” to each segment of space, specifying a degree of membership.

    If the posterior distribution fails to fall in the high probability region for the prior, then this prior is mis-informing the posterior (or the model itself is wrong).

    If the posterior distribution falls in the high probability region for the prior, then the prior is giving accurate information… but how much? A decent measure of how much would be to take the 95% posterior probability for the parameter, and calculate how much probability that region contains for the prior. If the ratio of the probabilities PriorP(95% posterior Interval)/0.95 is small, then the prior is diffuse with respect to the posterior. If the ratio is large then the prior is concentrated with respect to the posterior. This measure is a continuous measure of the diffuseness, so I would recommend calculating this quantity and simply reporting it such as:

    The ratio of the prior probability to the posterior probability for the 95% posterior region is 0.28 indicating prior information that is compatible with the posterior inference.

    or

    The ratio of the prior probability to the posterior probability for the 95% posterior region is 0.0021 indicating that the prior primarily suggested other regions of space than the region picked out by the posterior.

    note how this wording works fine if either the prior is over-diffuse, or the prior is concentrated but in a region different from the posterior.

    • Interesting quantification of a check that I normally just do graphically. Splitting hairs for a minute here, you write “If the posterior distribution fails to fall in the high probability region for the prior, then this prior is mis-informing the posterior (or the model itself is wrong).” A third possibility is that you have a (probably small) noisy sample that is jerking your posterior around. The model itself is fine (including prior) but you have a bad roll of the dice, so to speak. Of course, one can argue that this could be resolved by a “better” model, but in practice we don’t always have a reasonable way to get there.

      • If you have a truly small sample, like 1-5 data points this could be the case, but the most likely issue for a somewhat larger sample is that your model is misinformed about the size of the model + measurement errors, maybe due to a biased (low) prior for that size of error.

      • Another way this could happen is if you have a kind of clustered data process that you aren’t modeling. For example if your data comes in batches each with a common error.

        y = xbatch[i] + err

        and xbatch is constant for a while, and then changes… With a large sample you can take this as y = xbar + err2 where xbar is the mean of x and err2 incorporates both the err and the variation in xbatch

        But if you only have 1 or 2 values for xbatch in your small data sample… then you will go far wrong underestimating the variation in your data.

Leave a Reply to Chris Wilson Cancel reply

Your email address will not be published. Required fields are marked *