## All the names for hierarchical and multilevel modeling

The title Data Analysis Using Regression and Multilevel/Hierarchical Models hints at the problem, which is that there are a lot of names for models with hierarchical structure.

Ways of saying “hierarchical model”

hierarchical model
a multilevel model with a single nested hierarchy (note my nod to Quine’s “Two Dogmas” with circular references)
multilevel model
a hierarchical model with multiple non-nested hierarchies
random effects model
Item-level parameters are often called “random effects”; reading all the ways the term is used on the Wikipedia page on random effects illustrates why Andrew dislikes the term so much (see also here; both links added by Andrew)—it means many different things to different communities.
mixed effects model
that’s a random effects model with some regular “fixed effect” regression thrown in; this is where lme4 is named after linear mixed effects and NONMEM after nonlinear mixed effects models.
empirical Bayes
Near and dear to Andrew’s heart, because regular Bayes just isn’t empirical enough. I jest—it’s because “empirical Bayes” means using maximum marginal likelihood to estimate priors from data (just like lme4 does).
regularized/penalized/shrunk regression
common approach in machine learning where held out data is used to “learn” the regularization parameters, which are typically framed as shrinkage or regularization scales in penalty terms rather than as priors
automatic relevance determination (ARD)
Radford Neal’s term in his thesis on Gaussian processes and now widely adopted in the GP literature
This one’s common in the machine-learning literature; I think it came from Hal Daumé III’s paper, “Frustratingly easy domain adaptation” in which he rediscovered the technique; he also calls logistic regression a “maximum entropy classifier”, like many people in natural language processing (and physics)
variance components model
I just learned this one on the Wikipedia page on random effects models
cross-sectional model
apparently a thing in econometrics for groups of points measured at the same time; opposite of a time-series in which measurements invove mutliple times
nested data model, split-plot design, random coefficient
The Wikipedia page on multilevel models listed all these.
iterated nested Laplace approximation (INLA), expectation maximization (EM), …
Popular algorithmic approaches that get confused with the modeling technique.

I’m guessing the readers of the blog will have more items to add to the list.

If you liked this post

You might like my earlier post, Logistic regression by any other name.

1. Some Dude says:

I’ve often felt there would be some service in doing something like the above for a variety of models: providing a rosetta stone to illustrate the different names for the same model across disciplines, alongside the different assumptions/data structures that are implied.

2. Adan Becerra says:

Is empirical bayes really referring to the type of model? I always thought of it as more of an estimation technique.

What’s similarity between all these names? Partial pooling. Maybe we should just call them models for pooled/longitudinal data.

3. Mikhail Shubin says:

There are also other meanings for “hierarchical model”

for example, “Statistics for Spatio-Temporal Data” (Cressie, Wikle, 2015) uses the term “hierarchical model” for something I would call Hidden Markov model

• Graham W. Wright says:

I’ve also seen the term “hierarchical regression” used to refer to a sort of stepwise model where coefficients are grouped into “blocks” and added to the model one at a time in the assumed order of causal priority. This of course has nothing to do with nested data or random effects or anything, but it’s just another way these terms get thrown around.

4. Mikhail Shubin says:

by the way, it is not so simple to find the definition of “Hierarchical model” as Andrew uses it.

Like, can you point to the single simple definition?

• Andrew says:

Mikhail:

Either: a model for data that are clustered into (possibly non-nested) groups.

Or: a model where the parameters are themselves batched into batches that are modeled.

Sorry about saying “batched into batches”; I can’t say “batched into groups” because I already used “groups” in the first definition. I try to be unambiguous: data are in groups, parameters are in batches.

• Mikhail Shubin says:

Well, I kinda understand what it mean (the term “Hierarchical model” is the most self-explanatory term in the list). But I was searching for some more mathematical definition. You know, the model can be called hierarchical if parameter vector X can be split into groups X_1, X_2 and so one, data Y can be split into X_1, X_2 and so one, such that P(Y_1|X_1), P(Y_2|X_2) … are separate models. Something like this.

I was searching for a definite definition of a “Hierarchical model” to refer when I was writing my thesis intro. But I found none. I guess if I would be writing another theoretical text, I would have to quote this comment.

• I took the easy way out and defined it circularly.

Technically, I think of a hierarchical model as one involving a hierarchical prior. A hierarchical prior is one we fit jointly with the item-level parameters. For example, in the simplest regression case,

alpha[1], …, alpha[N] ~ normal(mu, sigma)
mu ~ normal(0, 2)
sigma ~ normal(0, 2)

defines a joint density p(alpha, mu, sigma) for “item”-level parameters (alpha[1], …, alpha[N]) and “population”-level parameters (mu, sigma). I use scare quotes here, because these levels can keep nesting (students in classrooms in schools in districts in states in countries in …) so there’s no natural item-level or population level.

5. Adan Becerra says:

There’s also confusion regarding the word “nested.” For example in one paper I wrote something like “we fit a 3 level multilevel model because patients are nested within surgeons who are nested within hospitals”

Then a reviewer said that the data aren’t nested because surgeons can operate at multiple hospitals. We responded with saying that we used the MCMCglmm package which accounts for the cross classification when estimating the random effects and changed the word “nesting” to “clustering”.

Thoughts?

• Andrew says:

In our book, Jennifer and I would refer to that example as non-nested.

• Adan Becerra says:

Fair enough. Clustering makes more sense to me anyway.

How does Stan deal with cross classification? I am currently working on getting data ready to use Stan for the first time. Does it estimate one random effect for a surgeon who operates at multiple hospitals? MCMCglmm estimates one random effect for this surgeon, but I believe other software like SuperMix (which uses empirical bayes). I know I know, SuperMix is written in fortran but Im just curious.

• Adan Becerra says:

I meant to say other software like SuperMix estimates separate random effects for surgeon A at hospital A and surgeon A at hospital B*

• Andrew says:

Stan will do whatever you tell it to. You can fit a model in which parameters vary by surgeon and are preserved across hospitals, or a model where surgeon parameters can vary by hospital. You can also do fit such model using Stan from R using stan_glmer() in rstanarm. Or using brms.

• Adan Becerra says:

Cool. After many years of knowing that Stan is superior to MCMCglmm, I am now going to actually stop being lazy and learn how to do it in Stan.

Will send you paper when it is accepted.

• Andrew says:

Or when it’s rejected, whichever comes first.

• Adan Becerra says:

BAHAHAHA it will surely be rejected at least once before it gets accepted. Journals such as JAMA often don’t appreciate rigorous methods or wouldn’t appreciate why we have these conversations. Or maybe my paper really is not all that important… Regardless Ill be happy to send it to you!

• Ben Bolker says:

Making no claims for superiority of Stan vs MCMCglmm, just here to point out that MCMCglmm can also do whichever of those things you want. Don’t quote me on it, but I think random=~surgeon+hospital should give you a single value per surgeon, whereas random=~hospital+hospital:surgeon would give you “surgeons nested within hospitals”, i.e. a different value for the surgeon in each hospital.

• Adan Becerra says:

You’re probably right. I do the former which would estimate one ranef per surgeon even if they operate at multiple hospitals.

I use these models for profiling. If surgeons were to operate on avg patient what would be mortality prob across all surgeons

• Adan Becerra says:

Side note:

BYE BYE inverse wishart priors :)

• Ben Bolker says:

And … MCMCglmm can do “parameter-expanded” priors, which corresponds to half-t priors for standard deviation parameters (i.e. for scalar random effects terms) and something analogous (???) for vector-valued/multivariate random effects terms … see section 8.0.2 of the “CourseNotes” vignette …

• Adan Becerra says:

Yeah the default is inverse wishart. Can certainly do scaled f but takes some work. I’m just too lazy I want something EZ

• Adan Becerra says:

Done. That was relatively easy. I still need to do a lot of model diagnostic checking and will probably add more variables but I got the model fitting down. Relatively easy. One all nighter did it.

• Peter Chapman says:

The fact that you used the word nested incorrectly might indicate that you are confused. But it doesn’t permit you to assert that “ there is also confusion over the word “nested””. I am not confused.

• Adan Becerra says:

You are correct. I was confused and I was wrong.

Luckily peer review worked by identifying the mistake and allowed us to make the easy word change from patients nested within surgeons nested within hospitals to patients clustered within surgeons clustered within hospitals. And now its published.

https://www.ncbi.nlm.nih.gov/pubmed/31082909

Thank you for pointing this out. Science progresses

6. Thomas Lumley says:

On it’s own ‘cross-sectional time series model’ doesn’t imply hierarchical structure — they come in ‘fixed-effects’ and ‘random-effects’ versions, where the fixed-effects version estimates a separate intercept (or more parameters) for each series and the random-effects version models them as draws from a distribution.

• Adan Becerra says:

Thanks for the clarification. Ive seen econometricians use fixed effects model terminology but it resembled a multilevel model that allows intercepts to vary. At first I thought econ fixed effects model meant including the clustering variable as a parameter in the model. So for an example where patients are clustered within surgeons, I thought that the econ fixed effect model would include surgeon ID as a fixed parameter (which would estimate k-1 parameters where k is number of surgeons), but now I see what you mean… I think lol

• Thanks—I can’t even back out where I saw that term when I was looking for synonyms.

7. Andrew says:

Bob:

You write, “’empirical Bayes’ means using maximum marginal likelihood…” Not quite. Empirical Bayes has different meanings. One meaning of empirical Bayes is hierarchical modeling using point estimates for the hyperparameters (not necessarily marginal maximum likelihood, by the way; traditionally it’s been common to use moment-based estimates). In other settings, empirical Bayes refers to a more general conceptual framework in which intermediate-level parameters are considered as latent or missing data and averaged over, whereas hyperparameters are considered as fixed quantities to be conditioned on. (And then there are the people who say you can’t condition on a fixed quantity; don’t get me started on that.) In any case, you’re right that I don’t like the term “empirical Bayes” because it seems to imply that plain old “Bayes” is not empirical.

• OK. Sounds like yet another case of vague and ambiguous naming in the stats literature. The Wikipedia entry for the Empirical Bayes Method cites Chris Bishop’s book for the definition (their [1], my emphasis):

Empirical Bayes methods are procedures for statistical inference in which the prior distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, for which the prior distribution is fixed before any data are observed. Despite this difference in perspective, empirical Bayes may be viewed as an approximation to a fully Bayesian treatment of a hierarchical model wherein the parameters at the highest level of the hierarchy are set to their most likely values, instead of being integrated out. Empirical Bayes, also known as maximum marginal likelihood,[1] represents one approach for setting hyperparameters.

I couldn’t follow the Casella (1985) paper or even the Efron and Morris (1975) paper as they never isolate the technique per se, but instead introduce it implicitly along with their particular application. I will never understand why statisticians can’t just define something clearly. They keep saying “empirical Bayes”, but never stop to define it other than by example.

8. Peter Chapman says:

Of all these names, the first I learned was “ components of variance” or “variance components”. If you look at any text book published prior to about 1960 this is what it was called. In those pre computer days books dealt mainly with balanced situations, and it was presented as a technique that enabled one to estimate the variance within each level of a hierarchy. It was/is very useful for understanding industrial processes for which the final product was too variable. It enabled one to gain an insight into those parts of the process that were causing this variability.

9. Terry says:

This is a very helpful post.

For quite a while, I thought “hierarchical models” were some exotic new technique, and that I had no hope of understanding them. Turns out that I already knew about them, just under a different name.

• Andrew says:

Terry:

Lots of treatments of hierarchical models, including my book with Jennifer, have oversold the idea by treating the methods as automatic and emphasizing that the group-level variance parameters can be estimated from data alone.

Our focus on this point was understandable as a reaction to earlier Bayesian work that considered the group-level variance parameters as part of the “prior” and thus required to be pre-specified in the analysis. I think we were making a valuable point to emphasize that these hyperparameters can, indeed, be estimated from the data.

But when the number of groups is not large, you can’t estimate the group-level variance parameter accurately from data alone, and it can help, for both computational and inferential reasons, to guide this inference using prior information. So now we’re moving toward a general recommendation to use informative priors when fitting multilevel models. That’s what we do in rstanarm, for example.

How did I realize the problem with purely data-based estimates of group-level variance parameters? I realized after seeing Jeff and Justin fit the same hierarchical regression model to a series of datasets (as part of an MRP project). We noticed the estimated group-level variance parameters jumping around from dataset to dataset, resulting in some substantively incoherent inferences.

Consider a simple case of a state-level variance parameter. When it’s estimated to be near 0, the state intercepts are partially pooled toward the national model, which might predict the state-level intercept based on Republican vote share in the previous election. When the state-level variance is estimated to be a high value, the state intercepts are not pooled, so these estimates will vary much more by state. Having vastly different amounts of pooling from dataset to dataset can wreak havoc on inferences over time. Ideally one would embed all this in a time series model, but in practice we’re often secret-weaponing it, fitting the model separately to each dataset, and we’ll want to regularize this estimate with an informative prior on the group-level variance.

The point is that if you fit a model, or a series of models, on just one dataset, everything can seem to make sense. But when you fit the model repeatedly to different data, the noisiness of the estimate can be more apparent.

Our perspective, and our practice, has changed. We oversimplified and oversold. And now I think we’re doing better.

P.S. OK, I guess this should be its blog post. It should appear in March or so.

• Terry: There are also philosophical views (the meta-theory of “hierarchical models”) – http://www.stat.columbia.edu/~gelman/research/unpublished/Amalgamating6.pdf

10. Andrew Timm says:

My professor for a Multilevel Models class at NYU (Marc Scott) had a great wisecrack about this at the beginning of the semester. It was something like:

“Multilevel Models are like 2 or 3 hard concepts and lots of different names and notations for everything.”

We had a lot of fun with writing things several different ways as an exercise in translating between disciplines.

11. Ron says:

Domain adaptation is closely related to generalization of findings or transportability “a la Pearl”

12. Dave C. says:

Actuaries call it “Credibility Theory”