Ed Vul writes:

In the course of tinkering with someone else’s hairy dataset with a great many candidate explanatory variables (some of which are largely orthogonal factors, but the ones of most interest are competing “binning” schemes of the same latent elements). I wondered about the following “model selection” strategy, which you may have alluded to in your multiple comparisons paper:

Include all plausible factors/interactions as random intercept effects (i.e., (1|A), (1|A:B) in lmer parlance [that’s stan_lmer() now — ed.]). Since there are many competing, non-orthogonal binning schemes included all at once, the model would be overdetermined (singular) if they were all included as fixed effects [he means “non-varying coefficients.” Or maybe he means “varying coefficients estimated without regularization.” The “fixed”/”random” terminology is unclear. — ed.]. However, as random effects [“varying coefficients estimated using regularization” — ed.], we can rely on partial pooling and shrinkage to sort out among them, such that variance along factors that are not well supported by the data (or are explained away by other factors) shrinks to zero. This happens quite decisively in lmer, but a “regularizing” exponential prior on the variances in a fully bayesian model would achieve something similar, I think [easy to do in stan or rstanarm — ed.]. (A more sophisticated approach would be to put a common, pooling prior on the variances for all the individual factors…)

This approach seems to yield sensible results, but I am a bit concerned because I have never seen it used by others, so I am probably missing something. It may just be that it is rarely computationally practical to include all candidate factors/binning-schemes in such an “overcomplete” model. Or perhaps there is a compelling reason why explanatory variables of substantive interest should be treated as fixed, rather than random effects? Is there a fundamental problem with this approach that I am not thinking of? Or is this a well-known technique that I have simply never heard of?

My reply (in addition to the editorial comments inserted above):

This sounds fine. I think, though, you may be overestimating what lmer will do (perhaps my fault given that I featured lmer in my multilevel modeling book). The variance estimate from lmer can be noisy. But, sure, yes, partial pooling is the way to go, I think. Once you’ve fit the model I don’t really see the need to shrink small coefficients all the way to zero, but I guess you can if you want. Easiest I think is to fit the model in Stan (or rstanarm if you want to use lmer-style notation).

Also, the whole fixed/random thing is no big deal. You can allow any and all coefficients to vary; it’s just that if you don’t have a lot of data then it can be a good idea to put informative priors on some of the group-level variance parameters to control the amount of partial pooling. Putting in all these priors might sound kinda weird but that’s just cos we don’t have a lot of experience with such models. Once we have more examples of these (and once they’re in the new edition of my book with Jennifer), then it will be no big deal for you and others to follow this plan in your own models.

Informative priors on group-level variances makes a lot more sense than making a priori decisions to do no pooling, partial pooling, or complete pooling.

The stan_{g}lmer functions in the **rstanarm** R package use a Gamma (by default exponential) prior on the standard deviations of group specific terms like (1|A). But if you have (1|A) + (1|B) + … + (1|Z), you get 26 independent priors on the standard deviations rather than partial pooling. Ed Vul seems to be referring to something more like Andrew’s co-authored paper with Sophie Si ( http://www.stat.columbia.edu/~gelman/research/unpublished/modelweighting.pdf ).

However, if you do something like (1 + x1 + x2 | g), then there is a 3×3 covariance matrix to estimate. Stan developers have long encouraged decomposing a covariance matrix into a correlation matrix and standard deviations and more recently be encouraging people to decompose the correlation matrix into its Cholesky factors. But people are still putting independent (often half-Cauchy, which is dubious) priors on the standard deviations, whereas **rstanarm** has always put a Dirichlet priors on the proportions of the unknown trace of the covariance matrix. By setting the concentration hyperparameter of the Dirichlet distribution to some number greater than 1, you can encourage the variances to be similar to each other. The unknown trace is set equal to the size of the matrix multiplied by the square of a scale parameter, which has a Gamma prior.

Would you suggest something like a horseshoe prior for the group variances (e.g., beta_group[i] ~ Normal(0,lambda_group * tau_global), where tau and the lambdas have half-Cauchy or half-t distributions)?

No, but I can’t say we have any actual research about that. Heavy-tailed priors on variances or standard deviations seem to not work as well as heavy-tailed priors on coefficients.

From

https://www.washingtonpost.com/news/rampage/wp/2017/09/28/free-speech-and-good-vs-bad-polls/

Andrew Gelman, a professor of statistics and political science at Columbia University who is arguably the statistics field’s biggest public intellectual.

And I thought you were the biggest methodological terrorist.

Why not both?