Zero-excluding priors are probably a bad idea for hierarchical variance parameters

(This is Dan, but in quick mode)

I was on the subway when I saw Andrew’s last post and it doesn’t strike me as a particularly great idea.

So let’s take a look at the suggestion for 8 schools using a centred parameterization.  This is not as comprehensive as doing a proper simulation study, but the results are sufficiently alarming to put the brakes on Andrew’s proposal. Also, I can do this quickly (with a little help from Jonah Gabry).

We know that this will lead to divergences under any priors that include a decent amount of mass at zero, so Andrew’s suggestion was to use a boundary avoiding prior.  In particular a Gamma(2,$latex \beta$) prior, which has mean $latex 2\beta^{-1}$ and variance $latex 2\beta^{-2}$.  So let’s make a table!

Here I’ve run RStan under the default settings, so it’s possible that some of the divergences are false alarms, however you can see that you need to pull the prior quite far to the right in order to reduce them to double digits. A better workflow would then look at pairs plots to see where the divergences are (see Betancourt’s fabulous workflow tutorial), but I’m lazy. I’ve also included a column for warning if there’s a chain that isn’t sampling the energy distribution appropriately.

beta Prior mean 0.01 quantile Divergences (/4000) Low BFMI Warning? 2.5% for tau 50% for tau 97.5% for tau
10.00 0.20 0.01 751 FALSE 0.04 0.19 0.57
5.00 0.40 0.03 342 FALSE 0.11 0.34 1.05
2.00 1.00 0.07 168 FALSE 0.30 0.99 2.81
1.00 2.00 0.15 195 FALSE 0.68 1.76 5.23
0.50 4.00 0.30 412 FALSE 0.64 2.78 8.60
0.20 10.00 0.74 34 FALSE 1.51 5.32 14.71
0.10 20.00 1.49 40 FALSE 1.74 6.92 18.71
0.05 40.00 2.97 17 FALSE 2.21 8.45 23.41
0.01 200.00 14.86 13 FALSE 2.88 9.37 27.86

You can see from this that you can’t reliably remove divergences and there is strong prior dependence based on how strongly the prior avoids the boundary.

But it’s actually worse than that. These priors heavily bias the group standard deviation! To see this, compare to the inference from two of the priors we normally use (which need to be fitted using a non-centred parameterization). If the prior for $latex \tau$ is a half-Cauchy with dispersion parameter 5, the (2.5%, 50%, 97.5%) posterior values for $latex \tau$ are (0.12 2.68 11.90).  If instead it’s a half-normal with standard deviation 5, the right tail shrinks a little to (0.13, 2.69, 9.23). None of the boundary avoiding priors give estimates anything like this.

So please don’t use boundary avoiding priors on the hierarchical variance parameters. It probably doesn’t work.

Git repo for the code is here.

Edit: At Andrew’s request, a stronger boundary avoiding prior. This one is an inverse gamma, which has a very very light left tail (the density goes to zero like $latex e^{1/\tau}$ for small $latex \tau$) and a very heavy right tail (the density goes like $latex \tau^{-3}$ for large $latex \tau$).  This is the same table as below. For want of a better option, I kept the shape parameter at 2, although I welcome other options.

beta Prior mean 0.01 quantile Divergences (/4000) Low BFMI Warning? 2.5% for tau 50% for tau 97.5% for tau
0.50 0.50 0.07 95 TRUE 0.09 0.32 1.91
1.00 1.00 0.15 913 TRUE 0.11 0.44 4.60
2.00 2.00 0.30 221 FALSE 0.29 1.10 5.17
5.00 5.00 0.74 64 FALSE 0.99 2.60 8.61
10.00 10.00 1.48 57 FALSE 1.41 4.17 12.38
20.00 20.00 2.96 4 FALSE 2.94 6.92 16.44
30.00 30.00 4.44 0 FALSE 4.08 8.86 19.99
40.00 40.00 5.92 0 FALSE 5.03 10.31 22.13
50.00 50.00 7.40 0 FALSE 5.88 11.56 24.98
100.00 100.00 14.79 0 FALSE 9.93 17.77 34.84

You can see that you need to use strong boundary avoidance to kill off the divergences, and the estimates for $latex \tau$ still seem a bit off (if you compare the prior and posterior quantiles, you can see that the data really wants to be smaller).

7 thoughts on “Zero-excluding priors are probably a bad idea for hierarchical variance parameters

  1. Dan:

    We did use the zero-avoiding gamma(2) priors for point estimation, but here I was actually thinking of stronger priors, maybe even lognormal, that really do exclude zero for real. I’d discussed this with Jonah in the ofc but I guess that didn’t make it into my post.

    Also, I disagree that the priors “bias the group standard deviation.” If the prior information is real, the priors aren’t biasing anything! That’s Bayesian inference for you.

    • I can do it for the log-normal if you want! Any suggestions for a ladder of parameters?

      >Also, I disagree that the priors “bias the group standard deviation.” If the prior information is real, the priors aren’t biasing anything! That’s Bayesian inference for you.

      It gives you a massively different set of values, and I suspect that this will tank predictive performance. But I can’t check that without a proper simulation study and it’s a Friday afternoon.

      • Regarding bias, I guess it depends. If we actually have prior information justifying pulling tau (in this case) away from zero in the prior then yeah that’s just a prior. But if we don’t have that prior knowledge then a prior like this seems like a trade off between bias and ease of computation. Of course, we make those trade offs all the time but the beauty of the non-centered parameterization is that it sidesteps the trade off in this case.

    • All of these have mode x, but they are tighter and tighter around x rather than having the mode being pulled farther and farther to the right. This makes much better sense to me, the point is you have some prior information that the parameter should be near some value x, and N becomes a “concentration” parameter saying how far away from x you’re willing to allow Stan to look.

Leave a Reply to Dan Simpson Cancel reply

Your email address will not be published. Required fields are marked *