## “Do you have any recommendations for useful priors when datasets are small?”

A statistician who works in the pharmaceutical industry writes:

I just read your paper (with Dan Simpson and Mike Betancourt) “The Prior Can Often Only Be Understood in the Context of the Likelihood” and I find it refreshing to read that “the practical utility of a prior distribution within a given analysis then depends critically on both how it interacts with the assumed probability model for the data in the context of the actual data that are observed.” I also welcome your comment about the importance of “data generating mechanism” because, for me, is akin to selecting the “appropriate” distribution for a given response. I always make the point to the people I’m working with that we need to consider the clinical, scientific, physical and engineering principles governing the underlying phenomenon that generates the data; e.g., forces are positive quantities, particles are counts, yield is bounded between 0 and 1.

You also talk about the “big data, small signal revolution.” In industry, however, we face the opposite problem, our datasets are usually quite small. We may have a new product, for which we want to make some claims, and we may have only 4 observations. I do not consider myself a Bayesian, but I do believe that Bayesian methods can be very helpful in industrial situations. I also read your Prior Choice Recommendations [see also discussion here — AG] but did not find anything specific about small sample sizes. Do you have any recommendations for useful priors when datasets are small?

When datasets are small, and when data are noisier, that’s when priors are more important. When in doubt, I think the way to explore anything in statistics, including priors, is through fake data simulation, which in this case will give you a sense of what is implied, in terms of potential patterns in data, from any particular set of prior assumptions. Typically we set priors to be too weak, and this can be seen in replicated data that include extreme and implausible results.

1. jd says:

Regarding fake data simulation for looking at priors, I was going through the code that Jonah Gabry posts here https://github.com/jgabry/bayes-vis-paper/blob/master/bayes-vis.R#L158 for the “Visualization in Bayesian workflow” paper that y’all wrote. When doing the simulation of fake data, it appears that the graphs are made simulating the parameters tau’s and beta’s and sigma from the priors a single time (for example – beta0 <- rnorm(1, 0, 100) ), and then generating all the fake data by including rnorm(Nsim, mean=0, sd=sigma) in the generative model. I was wondering if this whole process (i.e. generating model parameters from priors) should actually be done many times, because each time you simulate new model parameters, then you get completely different data, especially if you are simulating from vague priors. Just in case I haven't explained well, each time you run the code from line 158 to line 184, you get a different graph because different model parameters are simulated from the priors. Shouldn't all of this data be included in the graphs to get an idea of just what these priors might produce?

• Andrew says:

Jd:

Yes, you’re right. The general idea is described here. But even simulating once from the model can give a lot of insight in many examples. So we generally recommend simulating once, as a way to understand the model and capture gross problems. You can simulate many times for more systematic checks.

• MT says:

I’m a fan of using brms (as an api to backend stan) for quick prototyping of prior predictive simulations

here’s a note where a simple brms option toggle lets you flip from prior to posterior predictive draws

https://ucla.box.com/v/brms-prior-post-pred

the beginning is a toy example whereas page 11 and 12 show a more relevant example

2. Justin Smith says:

Are there examples of the likelihood swamping the prior when n is small? If not, I’d be worried with small n that Bayesian could become something like a ‘Drake Equation’ type of thing, where one can get about any posterior output based on their prior inputs. In that case, I’d want to see more defense of the agreed upon prior and a sensitivity analysis on the priors.

(similar thing applies to frequentist approaches. I’m not sure any method is too great for small n)

• Hjulmen Wulmen says:

The flatter the prior is relative to the likelihood function, the less of an effect it will have. With a completely uniform prior the only thing affecting the posterior distribution is the likelihood function. The less data there is, the flatter the likelihood function will be, diminishing its chances of swamping any prior that isn’t extremely flat.

But I don’t really get why likelihood not swamping the prior should be worrying. If there isn’t enough data to inform us, isn’t it it only natural that it shouldn’t affect our prior attitudes if that’s all we have? Of course then the inferences are contingent on the–shudder–subjective prior, but I reckon this is exactly why they should be made explicit. I’m not sure what “agreed upon” priors would be, since they depend on the situation–e.g. prior information on the specific subject- and the model specificaion/parameterization.

I think it is important to remember that any statistical modeling–be it frequentist or Bayesian–is contingent upon layers of subjective choices, and there is nothing inherently wrong with that, as long as this is recognized and made as explicit as practically possible. You could be as worried about how one can get any p-value they ever want be defining a null model that just gives them the results they want–in my view, there is nothing inherently more “objective” in the null models as they are currently used.

• In a Bayesian setting, uniformity isn’t as easy as one might think, as it is going to depend on the scale of the variable.

For example, suppose we have a uniform prior on a chance of success parameter, e.g.,

$latex \theta \sim \mathsf{Uniform}(0, 1).$

Then if we transform $latex \theta$ to the log odds scale,

$latex \phi = \mathrm{logit}(\theta) = \log \frac{\theta}{1 – \theta},$

we wind up with a non-uniform prior for the log odds,

$latex \phi \sim \mathsf{Logistic}(0, 1).$

You also have to be careful with positive constrained parameters where those wide proper priors aren’t uniform. They have a tendency to pull posterior mean estimates away from zero. So the intuitions from penalized maximum likelihood (where flat priors do very little) don’t carry over to Bayesian posterior means, which try to blend the prior and the data. If the prior says the value might be 1000, that’ll be taken into account in the posterior.

For example, consider this Stan program which uses BUGS-style wide proper priors on the mean and scale parameter of a normal distribution. There are three observations, -1, 0, and +1, coded in by hand.

parameters {
real sigma;
real mu;
}
model {
sigma ~ inv_gamma(1e-4, 1e-4);
mu ~ normal(0, 1e4);
[ -1, 0.0, 1.0 ] ~ normal(mu, sigma);
}

      mean se_mean  sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
sigma  1.7       0 2.1  0.5  0.8  1.2  1.8   6.1  4529    1
mu     0.0       0 1.4 -2.5 -0.5  0.0  0.5   2.4  4773    1


The posterior mean for sigma is 1.7 and the posterior 95% interval is (0.5, 6.1). The posterior median is still only 1.2. The maximum likelihood estimate (no priors) for the sd is 0.67 (divides by N); the unbiased sample standard deviation estimate is 1.0 (divides by N – 1). These are not what you get from the Bayesian posterior with a wide proper prior such as $latex \mathsf{InvGamma}(0.0001, 0.0001)$.

• Justin Smith says:

“But I don’t really get why likelihood not swamping the prior should be worrying. If there isn’t enough data to inform us, isn’t it it only natural that it shouldn’t affect our prior attitudes if that’s all we have? Of course then the inferences are contingent on the–shudder–subjective prior, but I reckon this is exactly why they should be made explicit.”

Because, IMO, actual data from a well-designed experiment is more important than prior attitudes/beliefs that can dramatically affect an outcome.

“I think it is important to remember that any statistical modeling–be it frequentist or Bayesian–is contingent upon layers of subjective choices,…”

I agree, but also believe that priors with hyperparameters can add even more subjectivity, more levers to pull to ‘hack’,

• Chris Wilson says:

“Because, IMO, actual data from a well-designed experiment is more important than prior attitudes/beliefs that can dramatically affect an outcome.”
Yes, but you are specifically worried about the situation of weak n data no? The safeguard here *is* the prior – at least an explicitly specified one. One goal for prior specification is regularization, which is almost always a good thing in order to avoid being jerked around by noisy, small-sample data.

“I agree, but also believe that priors with hyperparameters can add even more subjectivity, more levers to pull to ‘hack’”

Depends on how you look at it. The maximum likelihood estimate is (more or less, usually) equivalent to a Bayesian MAP with a flat prior. What that implies is that all values of theta are equally plausible, apart from the information in your small (potentially noisy) sample. This is a disastrous assumption in many cases.

In some sense, this whole discussion about Bayes is like the trolley problem in moral philosophy. Choosing *NOT* to pull the lever is still a choice, and the consequences can be very bad.

• Hjulmen Wulmen says:

No one is arguing against data informing our beliefs. As others have pointed out to you, if there is only a limited amount of data, it is important to understand its limitations: if there isn’t enough data to inform us, then that should be admitted and made explicit. There is no way around that.

• A useful prior is one that incorporates real information. So yes, you should justify in some way why you chose your prior, and what information you used. And fake data simulation is a good way to tune the prior, because often we know more about what the consequences of the prior should be for data than we do about the prior directly.

• Chris Wilson says:

Justin, you can readily check analytical expressions for simple cases like estimating means of normal distributions, how fast the posterior converges to the “true” value (and to the MLE for that matter). So yes, Bayesian posterior inference is always relative to the prior base measure. Sensitivity to prior specification can be checked, and is a good way of working with these situations IMO. Think of it as “if I go into this analysis thinking that theta is around mu, with a plausible range of {mu – e, mu + e}, then here is the inference for theta with these small n data”.
I believe Andrew has said in various places that the “best” statistical method is the one that best utilizes the most information. There is no magic in the Bayesian formalism from that point of view – it is just a coherent, comprehensive approach that solves a lot of problems, but still subject to GIGO like everything else…

• Anoneuoid says:

I’d be worried with small n that Bayesian could become something like a ‘Drake Equation’ type of thing, where one can get about any posterior output based on their prior inputs…I’m not sure any method is too great for small n

Any method that accurately conveys the uncertainty is great. There is nothing wrong happening in your drake equation example.

3. Dalton says:

I love and heavily utilize the practice of simulating data in my analytical workflow. But I really hate the term “fake data simulation.” “Fake” is simply too loaded a term. The data generated through simulation of the data generating process we hypothesize are not fake. They are simulated. There is substance to them, even if they are not “real”.

• Andrew says:

Dalton,

We sometimes use the term “pretend data.”

• Dalton says:

Why not just “simulated data” and “data simulation”? “Fake” and “pretend” both have a connotation of being fanciful. As opposed to “simulated” which implies there is more of a process to generating the data then just “pretending” or “faking” it.

Let me put it this way: I’d be much more comfortable getting on a flight with a first time pilot who trained on a flight simulator rather than first-time pilot who just pretended they were flying a bunch of times.

• Dalton says:

“Fake data simulation” is also a bit redundant. Is there such a thing as “real data simulation”?

Unfortunately, we do know there is such a thing as “fake data collection” and that it differs substantially in ethics and effort from the process of “real data collection.”

• I agree and avoid the term “fake data”. I’m writing a whole book about simulation and I’m never going to use the term.

I wonder if Andrew’s just trying to make it super clear to everyone it’s not “real” data.

• Keith O’Rourke says:

I repeatedly found that with non-statisticians at some point you need to stress generating “fake data” or “pretend data” for them to get what was being done and why. The word simulation just did not do it on its own.

p.s. in clinical research simulated patients are actors pretending to be real patients – truly fake patients.