There are two anti-patterns^{*} for prior specification in Stan programs that can be sourced directly to idioms developed for BUGS. One is the diffuse gamma priors that Andrew’s already written about at length. The second is interval-based priors. Which brings us to today’s post.

**Interval priors**

An interval prior is something like this in Stan (and in standard mathematical notation):

sigma ~ uniform(0.1, 2);

In Stan, such a prior presupposes that the parameter `sigma`

is declared with the same bounds.

real<lower=0.1, upper=2> sigma;

We see a lot of examples where users either don’t know or don’t remember to constrain `sigma`

. It’s impossible to infer bounds in general in Stan because of its underlying Turing-complete imperative programming language component—the bounds might be computed in a function call. BUGS’s restriction to directed graphical models lets you infer the bounds (at runtime).

**Computational problems**

Suppose the true value for `sigma`

lies somewhere near one of the boundaries. The boundaries are mapped to the unconstrained scale using a log-odds (aka logit) transform [thanks to Vadim Kantorov for the correction to the first version]:

sigma = 0.1 + 1.9 * inv_logit(sigma_free) sigma_free = logit((sigma - 0.1) / 1.9)

where `logit(u) = log(u / (1 - u))`

. Stan actually works on the unconstrained space, sampling `sigma_free`

and producing `sigma`

by inverse transform based on the declared constraints.

When `sigma`

approaches one of the boundaries, `sigma_free`

moves toward positive or negative infinity.

This leads to computational difficulty in setting step sizes if the possible values include both values near the boundary and values even a little bit away from the boundary. We need very large step sizes to move near the boundary and relatively small step sizes to move elsewhere. Euclidean Hamiltonian Monte Carlo, as used in Stan, fixes a single step size (it’s very challenging to try to change this assumption—jittering step size a bit rarely helps).

**Statistical problems**

The more worrying problem is statistical. Often these interval constraints are imposed under the assumption that the model author knows the value lies in the interval. Let’s suppose the value lies near the upper end of the interval (0.1, 2). Then what happens is that any posterior mass that would be outside of (0.1, 2) if the prior were uniform on (0.1, 5) is pushed below 2. This then reduces the posterior uncertainty and biases the mean estimate lower compared to the wider prior.

**Illustration**

A simple Stan program exemplifies the problem:

data { real L; real<lower=L> U; } parameters { real<lower=L, upper=U> y; } model { y ~ normal(0, 1); }

The parameter `y`

is given a lower bound and upper bound constraint where it is declared in the parameters block, then given a standard normal distribution in the model block. Without the uniform prior, `y`

should clearly have a standard normal distribution.

Now let’s fit it with a very wide interval for bounds, say (-100, 100).

> fit1 <- stan("interval.stan", data = list(L = -100, U = 100), iter=10000) > probs <- c(pnorm(-2:0)) > probs [1] 0.02275013 0.15865525 0.50000000 mean se_mean sd 2.275013% 15.86553% 50% n_eff Rhat y 0.00 0.01 0.99 -2.00 -1.00 0.00 7071 1

The `pnorm()`

function is the inverse cumulative distribution function for the standard normal (location zero, unit scale). So the value of `probs`

are the quantiles corresponding to values -2, -1, and 0 (roughly 0.022, 0.16, and 0.50).

The posterior appears to be standard normal, with Stan recovering the quantiles corresponding to -2, -1, and 0 values in the true distribution to within two decimal places (about the accuracy we expect here given the standard error report of 0.01).

What happens if we instead provide an interval prior with tighter bounds that is asymmetric around the mean of zero, say say uniform(-1, 2)? Let’s see.

> fit2 <- stan("interval.stan", data = list(L = -1, U = 2), iter=10000) > print(fit2, probs=probs) mean se_mean sd 2.275013% 15.86553% 50% n_eff Rhat y 0.23 0.01 0.73 -0.93 -0.56 0.17 7224 1

Now the posterior mean is estimated as 0.23 and the posterior median as 0.17. That’s not standard normal. What happened? The posterior is not only no longer symmetric (mean and median differ), it’s no longer centered around 0.

Even though we know the “true” posterior mean would be zero without the constraint, adding an interval constraint (-1, 2) modifies the posterior so that it is not symmetric, has a higher mean, and a lower standard deviation.

If we had chosen (-1, 1), the posterior would be symmetric, the posterior mean would still be zero, but the posterior standard deviations would be lower than with the (-100, 100) uniform prior.

**The take home message**

The whole posterior matters for calculating both the posterior mean, posterior variance, and posterior intervals. Imposing narrow, uniform priors on an interval can bias estimates with respect to wider interval priors.

The lesson is that uniform priors are dangerous if any posterior mass would extend past the boundaries if a wider uniform interval were used.

If you want a wide uniform prior, you can just use an improper uniform prior in Stan (as long as the posterior is proper).

If you think diffuse inverse gamma priors are the answer, that was the second anti-pattern I alluded to earlier. It’s described in Andrew’s paper Prior distributions for variance parameters in hierarchical models (published as a comment on another paper!) and in *BDA3*.

**But wait, there’s more**

If you want more advice from the Stan dev team on priors, check out our wiki page:

Or you can wait a few years for Andrew and Aki to consolidate it all into *BDA4*.

^{*} The Wikipedia page on anti-patterns requires “two key elements” of an anti-pattern:

- A commonly used process, structure, or pattern of action that despite initially appearing to be an appropriate and effective response to a problem, has more bad consequences than good ones.
- Another solution exists that is documented, repeatable, and proven to be effective.

Check and check.

YESSSSSSSSSSSSS!!!!! We had an example of this in an early version of the PC prior paper. The right hand side of Figure 2 here https://arxiv.org/pdf/1403.4630v1.pdf shows the uniform priors on the intervals [2,20], [2,50], and [2,100] converted to a prior parameterized by “distance to normality” (zero corresponds to a normal distribution, ie nu=infinity). You can see because of the weird structure of the parameter space that as you broaden out the interval (to nominally make it “less informative”), you are actually making it have more mass on approximately normal models.

You do occasionally get modelling situations where a parameter is non-negative on physical grounds, and where there’s appreciable prior probability both that it’s effectively zero and that it’s a bit away from zero, but not that it’s very large.

The one I’ve come across is source apportionment for particulate air pollution, where your data Y are mass concentrations of p specific ‘species’ (in the simplest cases, elements, so micrograms of arsenic in particles per cubic metre of air) and your model is that Y~FG+e where F are percentage concentrations of each species in k sources (eg, % arsenic in wood smoke) and G are mass concentrations of the sources (micrograms of wood smoke per cubic metre), and e are measurement errors with approximately known distribution. The constraint that all the entries of F and G are non-negative is important; it’s where any identification in the model comes from. And some species are essentially absent from some sources, and some sources are essentially absent at some times, which is again critical to F and G being even partly identifiable.

I don’t think there’s a simple solution here, but fortunately it’s a fairly unusual problem

The scale of a normal distribution is a typical example. The example you give has the mass clustering around zero. That’s indeed a challenging situation that Andrew considers in his series of papers on priors (like the zero-avoiding prior papers with Vince Dorie for MLE and his own papers on Bayesian priors). If I recall, the trick is to get priors that avoid zero, but don’t converge to zero too quickly—that is, they can be consistent with small values.

If you believe the mass can be exactly zero, you’ll have a slightly different computational problem. At the boundaries, derivatives are going to be infinite or not-a-number and thus reject.

I don’t like the zero-avoiding priors as priors (I’m fine with them as penalties, which is how they were used in the original paper).

+1

This type of situation also happens when you’re modelling a sparse signal. In that case the local variance parameter (which is a nuisance parameter) is often very nearly zero (if the signal is sparse, this is true for most values) but sometimes either intermediate or large. These problems have a lot of computational issues, but I guess that some of them will spring from that positivity preserving transformation (a log transform iirc).

I know there are variants of HMC for bounded parameter spaces (that do sensible things like reflecting the trajectory as it hits the boundary), which might do better for these types of situations.

Just wanted to say that I really appreciate these instructional blog posts, great work!

If you believe that a parameter is constrained to an interval, and equally likely to be anywhere in that interval, then I see nothing wrong with using a uniform prior. Indeed, it would be wrong not to use the uniform prior.

However, there may be a problem with people using uniform priors when they don’t actually believe them, because… Well, I don’t know why. We’re talking here about a one-dimensional parameter. If necessary, you can just sketch a density function by hand, scan your drawing, and after a bit of smoothing and normalization, use this completely custom representation of your prior belief. So why would you use something drastically wrong…?

As far as the computational issue, I think keeping the original parameterization and bouncing off the boundaries might work better than a logistic transformation. Supposing you are transforming, though, you probably should do more than “jitter the step size a bit”. You need to randomly (or systematically) vary the step size by a lot. If you randomly choose amongst 10 step sizes, the penalty is at most a factor of 10, which is probably better than fixing a single (not always appropriate) step size. But you could reduce this penalty a lot using the “short-cut” method described in section 5.6 of my Hamiltonian MCMC review paper, http://www.mcmchandbook.net/HandbookChapter5.pdf

Radford:

Thanks for commenting! I have three responses:

1. I don’t generally think that a model is about belief, but, sure, if a uniform distribution is reasonable, I have no problem with it. For example, there are a lot of problems for which it can be reasonable to assign a uniform(0,1) prior distribution to a parameter that represents a probability. My problem (and Bob’s) is with uniform distributions applied when there is no logical or physical constraint. For example, if you have a parameter that you want to say is not going to be too large in absolute value, I don’t recommend a uniform(-10,10) prior; instead I’d prefer something like normal(0,5), because soft constraints generally make more statistical sense and also make the computation smoother (recall the folk theorem).

2. Another problem can arise because other aspects of a model are misspecified. You might think a certain coefficient just has to be positive, but it could be negative in the data because of a hidden interaction. For example, if you raise the price on an item, sales should go down, but price can be correlated with quality, and if you have observational data and quality is not appropriately accounted for in the model, you can end up with the “wrong” sign of the coefficient of price on sales. For this sort of reason I’m wary of hard constraints.

3. With HMC, it can make sense to bounce off boundaries. We didn’t put this bouncing into Stan because it complicates the algorithm, especially in multidimensional problems.

For the most part I think people treat priors (and even model specification) as hyperparameters that need to be tuned in order to get something done in a reasonable time. They are commonly left to sane defaults, at least at first, then become subject to the dark arts. The uniform prior is just a “sane default”.

The main problem I see on the Stan forums is the use of these priors as defaults with a guess about the bounds that isn’t generous to do what they intend. I suspect modelers are bringing intuitions formed from maximum likelihood to Bayesian stats. With an MLE, as long as the interval prior captures the value you’d get without the interval constraint, there’s no effect. I just wanted to point out this isn’t true in a Bayesian setting.

The deeper issue is that I think rarely is someone’s prior knowledge in the form of a uniform interval constraint. In my experience, scientists usually know the rough range in which parameters are going to fall and say things like they expect it to be between -5 and 5. When prodded, they’ll usually say things like they think it’ll be 1 and would be very surprised if it were 10. Then they’ll put down a uniform(-5, 5) prior and it winds up having an effect other than what they intended, namely producing a different estimate than a uniform(-10, 10) prior would.

Bouncing might work better. We haven’t tried it in Stan. A simple hand-tooled comparison probably wouldn’t be too hard (where you hand code which parameters have which bounds in the solver rather than plumbing that through the whole framework). We also haven’t tried other alternative parameterizations. These would make great computational projects.

We also haven’t tried any kind of steps-ize variation other than jitter, which didn’t help with anything we tried. With Stan, a step size of 1/x does x times as much work to get to a U-turn point (at least until it hits the maximum tree depth, at which point you begin losing the advantage of HMC as you haven’t moved as far as you could have from the starting point).

Hello,

In equation “sigma_free = 0.1 + 1.9 * logit(sigma), where logit(u) = log(u / (1 – u))”, if you set sigma := 2, you get logit(sigma) = log(2 / (1 – 2)) which blows the log.

I should’ve mentioned that. It should return positive infinity, but will return not-a-number as written. Our actualy transforms are better at this, I think. In any case, getting the “correct” value of positive infinity won’t help—the round trip works out OK, but then the derivatives blow up. So you can’t actually get estimates right on the boundary other than through rounding or overflow/underflow.

Thanks for your comment. I was wondering if there was a typo in this expression. Should it not have been:

sigma = 0.1 + 1.9 * logistic(sigma_free)

sigma_free = logit((sigma – 0.1) / 1.9)

?

Thanks. I’ll fix it. If there’s a direction, sign, or numerator/denominator error to make, be sure I’ll make it.

Very interesting post!

I’ve been researching using Bayes to infer parameters for models of composite materials with hierarchical structure (we call these models multiscale instead of multilevel) from experimental data. There are a number of differences between this and many data applications since we have deterministic relationships between scales derived from physical principles as well as a pretty good understanding of the ballpark values of our parameters, but uniform priors over a specified interval tend to work pretty well if chosen carefully.

The standard practice for this type of inference is to use optimization to match the model with experimental data where the bounds for each parameter are pre specified, so the analogous prior is uniform over an interval. The optimization approach is riddled with pitfalls since experimental data is expensive and therefore often limited, and Bayes resolves a lot of these pitfalls pretty nicely. However with uniform priors an incorrect specification of the bounds remains a potential issue, but maybe normal priors would be a more robust choice.

I didn’t read this post before responding to Radford, but should have, since I think you summarized what I was trying to say—most science and engineering starts out with some knowledge of basic scales of answers expected, but not necessarily a strong constraint on the specific value.

If you choose the bounds carefully, it shouldn’t matter to optimization. If you don’t, you should get some signal that the system’s trying to fit at a boundary (a no-no with optimization as things like the Laplace approximation for uncertainty break down at boundaries).

We’ve found normal priors to be more robust. They can also help guide computation slightly if the data isn’t very informative about values. If you go to the technically more robust priors like low-degree-of-freedom student-t, then you can get strong tail effects when the data isn’t informative, especially if you go all the way to a Cauchy. If you make the interval priors wide enough, they won’t have much effect at all.