Skip to content
 

All the names for hierarchical and multilevel modeling

The title Data Analysis Using Regression and Multilevel/Hierarchical Models hints at the problem, which is that there are a lot of names for models with hierarchical structure.

Ways of saying “hierarchical model”

hierarchical model
a multilevel model with a single nested hierarchy (note my nod to Quine’s “Two Dogmas” with circular references)
multilevel model
a hierarchical model with multiple non-nested hierarchies
random effects model
Item-level parameters are often called “random effects”; reading all the ways the term is used on the Wikipedia page on random effects illustrates why Andrew dislikes the term so much (see also here; both links added by Andrew)—it means many different things to different communities.
mixed effects model
that’s a random effects model with some regular “fixed effect” regression thrown in; this is where lme4 is named after linear mixed effects and NONMEM after nonlinear mixed effects models.
empirical Bayes
Near and dear to Andrew’s heart, because regular Bayes just isn’t empirical enough. I jest—it’s because “empirical Bayes” means using maximum marginal likelihood to estimate priors from data (just like lme4 does).
regularized/penalized/shrunk regression
common approach in machine learning where held out data is used to “learn” the regularization parameters, which are typically framed as shrinkage or regularization scales in penalty terms rather than as priors
automatic relevance determination (ARD)
Radford Neal’s term in his thesis on Gaussian processes and now widely adopted in the GP literature
domain adaptation
This one’s common in the machine-learning literature; I think it came from Hal Daumé III’s paper, “Frustratingly easy domain adaptation” in which he rediscovered the technique; he also calls logistic regression a “maximum entropy classifier”, like many people in natural language processing (and physics)
variance components model
I just learned this one on the Wikipedia page on random effects models
cross-sectional (time-series) model
apparently a thing in econometrics
nested data model, split-plot design, random coefficient
The Wikipedia page on multilevel models listed all these.
iterated nested Laplace approximation (INLA), expectation maximization (EM), …
Popular algorithmic approaches that get confused with the modeling technique.

I’m guessing the readers of the blog will have more items to add to the list.

If you liked this post

You might like my earlier post, Logistic regression by any other name.

42 Comments

  1. Some Dude says:

    I’ve often felt there would be some service in doing something like the above for a variety of models: providing a rosetta stone to illustrate the different names for the same model across disciplines, alongside the different assumptions/data structures that are implied.

  2. Adan Becerra says:

    Is empirical bayes really referring to the type of model? I always thought of it as more of an estimation technique.

    What’s similarity between all these names? Partial pooling. Maybe we should just call them models for pooled/longitudinal data.

  3. Mikhail Shubin says:

    There are also other meanings for “hierarchical model”

    for example, “Statistics for Spatio-Temporal Data” (Cressie, Wikle, 2015) uses the term “hierarchical model” for something I would call Hidden Markov model

    • Graham W. Wright says:

      I’ve also seen the term “hierarchical regression” used to refer to a sort of stepwise model where coefficients are grouped into “blocks” and added to the model one at a time in the assumed order of causal priority. This of course has nothing to do with nested data or random effects or anything, but it’s just another way these terms get thrown around.

  4. Mikhail Shubin says:

    by the way, it is not so simple to find the definition of “Hierarchical model” as Andrew uses it.

    Like, can you point to the single simple definition?

    • Andrew says:

      Mikhail:

      Either: a model for data that are clustered into (possibly non-nested) groups.

      Or: a model where the parameters are themselves batched into batches that are modeled.

      Sorry about saying “batched into batches”; I can’t say “batched into groups” because I already used “groups” in the first definition. I try to be unambiguous: data are in groups, parameters are in batches.

      • Mikhail Shubin says:

        Well, I kinda understand what it mean (the term “Hierarchical model” is the most self-explanatory term in the list). But I was searching for some more mathematical definition. You know, the model can be called hierarchical if parameter vector X can be split into groups X_1, X_2 and so one, data Y can be split into X_1, X_2 and so one, such that P(Y_1|X_1), P(Y_2|X_2) … are separate models. Something like this.

        I was searching for a definite definition of a “Hierarchical model” to refer when I was writing my thesis intro. But I found none. I guess if I would be writing another theoretical text, I would have to quote this comment.

        • I took the easy way out and defined it circularly.

          Technically, I think of a hierarchical model as one involving a hierarchical prior. A hierarchical prior is one we fit jointly with the item-level parameters. For example, in the simplest regression case,

          alpha[1], …, alpha[N] ~ normal(mu, sigma)
          mu ~ normal(0, 2)
          sigma ~ normal(0, 2)

          defines a joint density p(alpha, mu, sigma) for “item”-level parameters (alpha[1], …, alpha[N]) and “population”-level parameters (mu, sigma). I use scare quotes here, because these levels can keep nesting (students in classrooms in schools in districts in states in countries in …) so there’s no natural item-level or population level.

  5. Adan Becerra says:

    There’s also confusion regarding the word “nested.” For example in one paper I wrote something like “we fit a 3 level multilevel model because patients are nested within surgeons who are nested within hospitals”

    Then a reviewer said that the data aren’t nested because surgeons can operate at multiple hospitals. We responded with saying that we used the MCMCglmm package which accounts for the cross classification when estimating the random effects and changed the word “nesting” to “clustering”.

    Thoughts?

    • Andrew says:

      Adan:

      In our book, Jennifer and I would refer to that example as non-nested.

      • Adan Becerra says:

        Fair enough. Clustering makes more sense to me anyway.

        How does Stan deal with cross classification? I am currently working on getting data ready to use Stan for the first time. Does it estimate one random effect for a surgeon who operates at multiple hospitals? MCMCglmm estimates one random effect for this surgeon, but I believe other software like SuperMix (which uses empirical bayes). I know I know, SuperMix is written in fortran but Im just curious.

    • Peter Chapman says:

      The fact that you used the word nested incorrectly might indicate that you are confused. But it doesn’t permit you to assert that “ there is also confusion over the word “nested””. I am not confused.

  6. Thomas Lumley says:

    On it’s own ‘cross-sectional time series model’ doesn’t imply hierarchical structure — they come in ‘fixed-effects’ and ‘random-effects’ versions, where the fixed-effects version estimates a separate intercept (or more parameters) for each series and the random-effects version models them as draws from a distribution.

  7. Andrew says:

    Bob:

    You write, “’empirical Bayes’ means using maximum marginal likelihood…” Not quite. Empirical Bayes has different meanings. One meaning of empirical Bayes is hierarchical modeling using point estimates for the hyperparameters (not necessarily marginal maximum likelihood, by the way; traditionally it’s been common to use moment-based estimates). In other settings, empirical Bayes refers to a more general conceptual framework in which intermediate-level parameters are considered as latent or missing data and averaged over, whereas hyperparameters are considered as fixed quantities to be conditioned on. (And then there are the people who say you can’t condition on a fixed quantity; don’t get me started on that.) In any case, you’re right that I don’t like the term “empirical Bayes” because it seems to imply that plain old “Bayes” is not empirical.

    • OK. Sounds like yet another case of vague and ambiguous naming in the stats literature. The Wikipedia entry for the Empirical Bayes Method cites Chris Bishop’s book for the definition (their [1], my emphasis):

      Empirical Bayes methods are procedures for statistical inference in which the prior distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, for which the prior distribution is fixed before any data are observed. Despite this difference in perspective, empirical Bayes may be viewed as an approximation to a fully Bayesian treatment of a hierarchical model wherein the parameters at the highest level of the hierarchy are set to their most likely values, instead of being integrated out. Empirical Bayes, also known as maximum marginal likelihood,[1] represents one approach for setting hyperparameters.

      I couldn’t follow the Casella (1985) paper or even the Efron and Morris (1975) paper as they never isolate the technique per se, but instead introduce it implicitly along with their particular application. I will never understand why statisticians can’t just define something clearly. They keep saying “empirical Bayes”, but never stop to define it other than by example.

  8. Peter Chapman says:

    Of all these names, the first I learned was “ components of variance” or “variance components”. If you look at any text book published prior to about 1960 this is what it was called. In those pre computer days books dealt mainly with balanced situations, and it was presented as a technique that enabled one to estimate the variance within each level of a hierarchy. It was/is very useful for understanding industrial processes for which the final product was too variable. It enabled one to gain an insight into those parts of the process that were causing this variability.

  9. Terry says:

    This is a very helpful post.

    For quite a while, I thought “hierarchical models” were some exotic new technique, and that I had no hope of understanding them. Turns out that I already knew about them, just under a different name.

    • Andrew says:

      Terry:

      Lots of treatments of hierarchical models, including my book with Jennifer, have oversold the idea by treating the methods as automatic and emphasizing that the group-level variance parameters can be estimated from data alone.

      Our focus on this point was understandable as a reaction to earlier Bayesian work that considered the group-level variance parameters as part of the “prior” and thus required to be pre-specified in the analysis. I think we were making a valuable point to emphasize that these hyperparameters can, indeed, be estimated from the data.

      But when the number of groups is not large, you can’t estimate the group-level variance parameter accurately from data alone, and it can help, for both computational and inferential reasons, to guide this inference using prior information. So now we’re moving toward a general recommendation to use informative priors when fitting multilevel models. That’s what we do in rstanarm, for example.

      How did I realize the problem with purely data-based estimates of group-level variance parameters? I realized after seeing Jeff and Justin fit the same hierarchical regression model to a series of datasets (as part of an MRP project). We noticed the estimated group-level variance parameters jumping around from dataset to dataset, resulting in some substantively incoherent inferences.

      Consider a simple case of a state-level variance parameter. When it’s estimated to be near 0, the state intercepts are partially pooled toward the national model, which might predict the state-level intercept based on Republican vote share in the previous election. When the state-level variance is estimated to be a high value, the state intercepts are not pooled, so these estimates will vary much more by state. Having vastly different amounts of pooling from dataset to dataset can wreak havoc on inferences over time. Ideally one would embed all this in a time series model, but in practice we’re often secret-weaponing it, fitting the model separately to each dataset, and we’ll want to regularize this estimate with an informative prior on the group-level variance.

      The point is that if you fit a model, or a series of models, on just one dataset, everything can seem to make sense. But when you fit the model repeatedly to different data, the noisiness of the estimate can be more apparent.

      Our perspective, and our practice, has changed. We oversimplified and oversold. And now I think we’re doing better.

      P.S. OK, I guess this should be its blog post. It should appear in March or so.

    • Terry: There are also philosophical views (the meta-theory of “hierarchical models”) – http://www.stat.columbia.edu/~gelman/research/unpublished/Amalgamating6.pdf

  10. Andrew Timm says:

    My professor for a Multilevel Models class at NYU (Marc Scott) had a great wisecrack about this at the beginning of the semester. It was something like:

    “Multilevel Models are like 2 or 3 hard concepts and lots of different names and notations for everything.”

    We had a lot of fun with writing things several different ways as an exercise in translating between disciplines.

  11. Ron says:

    Domain adaptation is closely related to generalization of findings or transportability “a la Pearl”

  12. Dave C. says:

    Actuaries call it “Credibility Theory”

Leave a Reply