## Avoiding boundary estimates using a prior distribution as regularization

For awhile I’ve been fitting most of my multilevel models using lmer/glmer, which gives point estimates of the group-level variance parameters (maximum marginal likelihood estimate for lmer and an approximation for glmer). I’m usually satisfied with this–sure, point estimation understates the uncertainty in model fitting, but that’s typically the least of our worries.

Sometimes, though, lmer/glmer estimates group-level variances at 0 or estimates group-level correlation parameters at +/- 1. Typically, when this happens, it’s not that we’re so sure the variance is close to zero or that the correlation is close to 1 or -1; rather, the marginal likelihood does not provide a lot of information about these parameters of the group-level error distribution.

I don’t want point estimates on the boundary. I don’t want to say that the unexplained variance in some dimension is exactly zero.

One way to handle this problem is full Bayes: slap a prior on sigma, do your Gibbs and Metropolis, and summarize posterior inferences by simulation. That’s fine but it’s another bit of programming and computation. We wanted to be able to do it using the existing lmer/glmer (or, in Stata, gllamm) framework.

Yeojin Chung, Sophia Rabe-Hesketh, Jingchen Liu, Vince Dorie, and I came up with a solution. We compute a maximum penalized marginal likelihood estimate, the marginal posterior mode under a gamma (not inverse gamma; you know how I hate that) prior for the scale parameter. Our default choice is gamma (2, 1/A), where A is set to some large value such as 10 (rescaled relative to the regression predictors, as usual). The shape parameter of 2 gives a prior that is linear at 0: thus, the posterior mode can’t be 0 under any circumstances–but it can be as close to 0 as is required based on the likelihood. Some basic algebra shows that if the marginal maximum likelihood estimate is at 0, our posterior mode estimate will be about 1 standard error away from 0, which seems about right. The rate parameter, 1/A, is set to be small so that the prior will only pull the estimate down if it is unexpectedly large, as might occur if the number of groups is small.

Our prior distribution is weakly informative in that it does something to the estimate but is generic and doesn’t have much information as we would have in just about any particular example. I think that in many applied examples it would make sense to use stronger prior information. But I think that, as a default choice, the weakly informative gamma (2, 1/A) prior makes more sense than the uniform distribution (corresponding to the nonregularized estimate).

In the multivariate case, we use a Wishart (not inverse-Wishart), which when we choose the right parameters keeps the variances away from 0 and the correlation matrix positive definite.

One interesting feature of these models is that the weakly-informative regularizing prior we use for a point estimate is not the same as the weakly-informative prior that we would use for full Bayes (where there is no need to keep the posterior mode away from the boundary).

Here’s the abstract to our paper:

Variance parameters in mixed or multilevel models can be difficult to estimate, espe- cially when the number of groups is small. We propose a maximum penalized likelihood approach which is equivalent to estimating variance parameters by their marginal poste- rior mode, given a weakly informative prior distribution. By choosing the prior from the gamma family with at least 1 degree of freedom, we ensure that the prior density is zero at the boundary and thus the marginal posterior mode of the group-level variance will be positive. The use of a weakly informative prior allows us to stabilize our estimates while remaining faithful to the data.

Let me again thank my collaborators, Yeojin Chung, Sophia Rabe-Hesketh, Jingchen Liu, Vince Dorie. We’re now working on various generalizations, including multivariate distributions, generalized linear models, and hierarchical models for the variance parameters (going beyond the simple ideas in section 6 of my article from a few years ago).

1. ahuri says:

Hello,
probably a very naive question but can the Wishart prior exposed in your paper (p.22) considered weakly informative for full Bayesian inference too?
Cheers

2. Funny, I was thinking the same thing — a Gamma prior — for the same problem. I'm sure it was osmosis from the dangers of the InvGam(0.001, 0.001) problem, and the desire to pick a prior with zero density at zero. I'll consider myself scooped. =)

3. NMU says:

If you go fully Bayesian, what regularization prior do you like for correlation parameters? I'm doing regressions (not multilevel) estimates with AR(1) errors, and the posteriors I get sometimes put the most probability at +1, for problems where I'd expect the actual correlation to be maybe 0.75-0.9, or at most 0.95.

4. Andrew Gelman says:

Ahuri:

I like the scaled-inverse-Wishart family (as described in my book with Jennifer) for a weakly informative prior for full Bayes. We're also working on an eigenvalue-based parameterization.

AT:

More than once, I've written up (and even published) something that was first published only a couple years earlier. Sometimes an idea is just in the air. This is a funny one, though. Technologically speaking, the gamma prior for the hierarchical scale parameter could have been proposed decades ago. But back then we were all trying to be noninformative.

NMU:

For correlations of varying coefficients (after centering or approximately centering the coefficients), we've been talking about the scaled-inverse-Wishart. In our default parameterizations this will center the correlations around 0.

I don't have any immediate thoughts on a prior for autocorrelation parameters. But perhaps a reparameterization would make sense? I get worried about these high AR coefficients because they are typically tangled in the likelihood with any trend parameters in the model.

5. NMU says:

Thanks. I've suspected the high AR coefficients arise from confounding with the trend parameters, or else possibly misspecification of the AR model (seems normal and no obvious power at higher lags, but who knows). I've also suspected reparameterization would help, but nothing seems obvious to me.

6. Pablo Verde says:

Excellent article! Where I can get the R script for the meta-analysis of 8-school data?
I would like to re-do the example.
Thanks!