y = xbatch[i] + err

and xbatch is constant for a while, and then changes… With a large sample you can take this as y = xbar + err2 where xbar is the mean of x and err2 incorporates both the err and the variation in xbatch

But if you only have 1 or 2 values for xbatch in your small data sample… then you will go far wrong underestimating the variation in your data.

]]>If the posterior distribution fails to fall in the high probability region for the prior, then this prior is mis-informing the posterior (or the model itself is wrong).

If the posterior distribution falls in the high probability region for the prior, then the prior is giving accurate information… but how much? A decent measure of how much would be to take the 95% posterior probability for the parameter, and calculate how much probability that region contains for the prior. If the ratio of the probabilities PriorP(95% posterior Interval)/0.95 is small, then the prior is diffuse with respect to the posterior. If the ratio is large then the prior is concentrated with respect to the posterior. This measure is a continuous measure of the diffuseness, so I would recommend calculating this quantity and simply reporting it such as:

The ratio of the prior probability to the posterior probability for the 95% posterior region is 0.28 indicating prior information that is compatible with the posterior inference.

or

The ratio of the prior probability to the posterior probability for the 95% posterior region is 0.0021 indicating that the prior primarily suggested other regions of space than the region picked out by the posterior.

note how this wording works fine if either the prior is over-diffuse, or the prior is concentrated but in a region different from the posterior.

]]>Justin

http:/twww.statisticool.com

*I asterisk this because this can happen in seemingly low dimensional problems with non-linearities or other complexities in their dynamics

]]>Andrew, please, please don’t put this into a publication anywhere. Even the capitulation to put it here as a reference, without qualification, isn’t a good idea. I see where you admit it’s untested but that’s the problem.

I’m not arguing that the idea is bad in itself. However, Cohen’s effect sizes, .05, and others, have all become unthinking “rules”, supported by references, as the thing to do. There’s very little support here for the idea and, of course, due to lack of exploration of it, no qualification of it. A better blog post for people to cite might be the exploration of this as an idea, not to focus on the 0.1 but instead to describe how one might go about arguing a prior is informative for a particular dataset in general with, at the very most, 0.1 as an example.

(And I do like the idea of looking at the posterior and arguing the prior is informative. I don’t see anything wrong with admitting the placeholder or default prior affects things, or perhaps is similar to the actual distribution.)

]]>– measure data info by posterior relative to prior

– if data info small (posterior similar to prior) then the estimate is driven by prior

He also has some stuff on prior-data conflict

]]>b)what is the prior sd in a hierarchical model? Say theta ~ N(mu, tau), and mu and tau has their own priors. One can simulate the prior sd for theta by simulating tau and mu first and then simulating theta and computing its sd. But it means as the structure of the hierarch goes deeper and deeper (say another hierarchical priors on mu and tau), the prior will become weaker and weaker?

]]>Using the typical Gaussian example (with location mu unkown and scale sigma=1) a prior Normal(0, 10) would never be informative because even with single data point X the posterior would have standard deviation 0.995 (i.e. 9.95% of the prior standard deviation). The same is true for any normal prior with standard deviation larger than 10.

If the prior is Normal(0,1), which is equivalent to the information carried by one single prior data point X=0, then the prior would be informative if the data consists of less than 99 points. For N=99, the standard deviation of the posterior would be 0.1, exactly 0.1 times the standard deviation of the prior. And the estimate for mu would be 99 times closer to the average of the data than to the center of the prior.

In this normal example, the condition sd(posterior)>0.1*sd(prior) means that unless the precision of the likelihood is at least 99 times the precision of the prior the latter will be informative. You need one hundred times more data than the quantity of data implicit in the prior to say the prior is uninformative ()

And the “informativeness” would be the same regardless of how far the prior is from the data. If we consider the following examples (as before: sigma=1, N=99), both sitting at the suggested threshold of 0.1:

a) Prior: Normal( 0 , 1 ) / Likelihood: Normal( 0.1 , 0.1 ) / Posterior: Normal( 0.1 , 0.1 )

b) Prior: Normal( 0 , 1 ) / Likelihood: Normal( 10 , 0.1 ) / Posterior: Normal( 9.9 , 0.1 )

c) Prior: Normal( 0 , 1 ) / Likelihood: Normal( 100 , 0.1 ) / Posterior: Normal( 99 , 0.1 )

In the first case, the posterior is almost identical to the data and completely included within the prior.

In the second case, the posterior is far from the prior but has substantial overlap with the likelihood (center displaced by one standard deviation).

In the third case, the posterior is far from either the prior or the likelihood.

]]>In our 1996 paper we made some prior sensitivity graphs using a method I call “static sensitivity analysis.” In this 2017 post I connected this idea to the work of Giordano et al. See also this followup from 2018.

]]>Coincidentally, the paper Jonathan (another one) cites is about DSGE models. We’re about to start working on those more intensely in Stan now that we have the basic algebraic and differential equation solvers in place. Next up is the combination differential algebraic equation solver, which I think Yi Zhang has finished coding. Charles Margossian has been working with Shoshana Vasserman specifically on economic equilibrium models and Charles is busy exploring better solvers (probably requiring higher-order derivatives somewhere). Then we add stochasticity (that goes well beyond my stat theory and applied math level) and perhaps even PDE solvers (which go beyond my applied math level). It’s going to take a while to find good algorithms for all this stuff that we can differentiate through.

Essentially, the way Stan handles HMC is by calculating sensitivites (derivatives) of the posterior to all of the parameters (not just the prior parameters).

]]>Andrew’s proposed notion of informativeness is a bit odd from a Bayesian perspective in that it’s not about what I know about the problem domain, but about the size of the data. I think Andrew’s asking something like when the data measureably overwhelms the prior, or when the “data speaks for itself” or when one might avoid the criticism of “putting one’s thumb on the scale”.

The metric we’re looking for would measure concentration and compatibility of posterior with prior. That needs to account for the kind of sensitivity you get with incompatible priors. Sensitivity here means the derivative of posterior mean(s) w.r.t. prior parameter(s).

The following little experiment was inspired by a visualization of sensitivity Duco Veen was showing me when he was here to illustrate incompatible priors and their effect on concentration of measure.

I simulate 10 observations

y[1], ..., y[10] ~ normal(0, 1),

and then used a model

y ~ normal(mu, sigma),

with priors

sigma ~ normal(0, 10) mu ~ normal(mu_mu, sigma_mu)

Here’s a quick table; it should really be a heat map with a line at the informativeness boundary (actually multiple heat maps with varying sizes of data).

prior posterior (approx) informative ---------------- ------------------- ----------- normal(0, 0.1) normal( 0, 0.09) * normal(0, 1) normal(-0.02, 0.34) * normal(0, 2) normal(-0.02, 0.35) * normal(0, 3) normal(-0.02, 0.36) * normal(0, 4) normal(-0.02, 0.36) normal(0, 10) normal(-0.02, 0.36) normal(0, 100) normal(-0.02, 0.36) normal(0, 4) normal(-0.02, 0.36) normal(1, 4) normal(-0.02, 0.36) normal(5, 4) normal( 0.02, 0.37) normal(10, 4) normal( 0.06, 0.37) normal(15, 5) normal( 0.10, 0.38) normal(20, 4) normal( 0.15, 0.41) * normal(100, 4) normal(94, 4) *

The prior isn’t very sensitive to prior scale between normal(0, 1) and normal(0, 100), but it only stops being considered informative at normal(0, 4).

On the other hand, the prior is pretty sensitive between normal(0, 4) and normal(15, 4), but is only considered informative at normal(20, 4).

When do we consider a prior not compatible with the data? With normal(10, 4), the posterior mean has a z-score of 2.5, but with normal(20, 4), it’s a z-score of 5, which is getting out there, whereas clearly with normal(100, 4), we’re beyond any notion of prior compatibility.

**Methodology Section**

Here’s the Stan code.

data { real mu_mu; realsigma_mu; int N; vector[N] y; } parameters { real mu; real sigma; } model { y ~ normal(mu, sigma); mu ~ normal(mu_mu, sigma_mu); sigma ~ normal(0, 10); }

Here’s the driver code.

library(rstan) model < - stan_model('normal_sens.stan') N <- 10 y <- rnorm(10) prior_test <- function(model, y, sigma_mu, mu_mu) { data <- list(N = length(y), y = y, mu_mu = mu_mu, sigma_mu = sigma_mu) fit <- sampling(model, data, iter=1e5, refresh = 0) print("", quote = FALSE) print(sprintf("sigma_mu = %10.5f mu_mu = %10.5f", sigma_mu, mu_mu), quote = FALSE) print(fit, prob = c(), digits = 4, pars = c("mu", "sigma")) } for (log10_sigma_mu in -2:2) prior_test(model, y, mu_mu = 0, sigma_mu = 10^log10_sigma_mu) for (log10_mu_mu in 0:4) prior_test(model, y, mu_mu = 10^log10_mu_mu, sigma_mu = 4)

I should've used seeds, but in lieu of that, here's my simulated y,

> y [1] -2.0875598 1.2460557 0.4382849 0.4323869 0.4206121 -0.3414313 [7] -0.1056257 -0.1234214 0.8128717 -0.9181756]]>

PS… it was the first thing that came up when I googled “testing for the informativeness of a prior” ]]>