## How about zero-excluding priors for hierarchical variance parameters to improve computation for full Bayesian inference?

So. For awhile now we’ve moved away from the uniform (or, worse, inverse-gamma!) prior distributions for hierarchical variance parameters. We’ve done half-Cauchy, folded t, and other options; now we’re favoring unit half-normal.

We also have boundary-avoiding priors for point estimates, so that in 8-schools-type problems, the posterior mode won’t be zero. Something like the gamma(2) family keeps the mode away from zero while allowing it to be arbitrarily close to zero, depending on the likelihood. See here for details; I love this stuff.

Until just now, I thought of these as two separate problems: you can use the zero-avoiding prior if your goal is point estimation, but if you want to use full Bayes there’s no need for zero-avoidance because you’re averaging over the posterior distribution anyway.

But then we were talking about the well-known problem of parameterization in hierarchical models (see for example section 5.2 of this paper, or, for a more modern, Stan-centered take, this discussion from Mike Betancourt): simulation can go slow if you use the standard (“centered”) parameterization and there’s a big posterior mass near zero, and you’ll want to switch to the noncentered parameterization (as here, or you can just check the Stan manual).

The difficulty arises in that, depending on the data, sometimes the centered parameterization can be slow, other times the noncentered parameterization can be slow.

So, what about using a zero-excluding prior, even for full Bayes, just to get rid of the need for the non-centered parameterization? You can only really get this to work if you exclude zero for real (a weak gamma(2) prior won’t be enough). But in lots of problems, I’d guess most problems, we’d know enough about scale to do this. We’re in folk theorem territory.

It’s almost like we’re moving toward some discreteness in our modeling and computing, where each variance parameter is either being zeroed out or is being modeled as distinct from zero. So, instead of trying to get Stan (or whatever you’re using) to cover the entire space, you split up the problem externally into “variance = 0” and “variance excluded from 0” components.

Gotta think more about this one. Interesting how this one example keeps revealing further insights. God is in every leaf of every tree.

P.S. Dan Simpson responds.

### 5 Comments

1. Jonathan (another one) says:

So, instead of trying to get Stan (or whatever you’re using) to cover the entire space, you split up the problem externally into “variance = 0” and “variance excluded from 0” components

So in other words, sometimes the point null is correct?

2. pwyll says:

It sounds like you’re really not a fan of using an inverse-gamma prior for variance. Is there a writeup you can point me to that goes over the reasons? Thanks!

• pwyll says:

Never mind, I see you’ve linked to resources in the post. Will dig into them now…

3. Dan Simpson says:

There’s a pretty easy simulation study to be done here. I’m moderately sure the answer is that they give worse predictions and are hard to calibrate.

4. I’ve been using gamma(N,(N-1)/x) priors for positive quantities a fair amount. It puts maximum density at x and as you increase N it becomes more and more concentrated around that x, N typically chosen somewhere between 3 and 20 depending on how much information you have about how close to x the parameter is.