The title *Data Analysis Using Regression and Multilevel/Hierarchical Models* hints at the problem, which is that there are a lot of names for models with hierarchical structure.

**Ways of saying “hierarchical model”**

*hierarchical model*- a multilevel model with a single nested hierarchy (note my nod to Quine’s “Two Dogmas” with circular references)
*multilevel model*- a hierarchical model with multiple non-nested hierarchies
*random effects model*- Item-level parameters are often called “random effects”; reading all the ways the term is used on the Wikipedia page on random effects illustrates why Andrew dislikes the term so much (see also here; both links added by Andrew)—it means many different things to different communities.
*mixed effects model*- that’s a random effects model with some regular “fixed effect” regression thrown in; this is where lme4 is named after linear mixed effects and NONMEM after nonlinear mixed effects models.
*empirical Bayes*- Near and dear to Andrew’s heart, because regular Bayes just isn’t empirical enough. I jest—it’s because “empirical Bayes” means using maximum marginal likelihood to estimate priors from data (just like lme4 does).
*regularized/penalized/shrunk regression*- common approach in machine learning where held out data is used to “learn” the regularization parameters, which are typically framed as shrinkage or regularization scales in penalty terms rather than as priors
*automatic relevance determination (ARD)*- Radford Neal’s term in his thesis on Gaussian processes and now widely adopted in the GP literature
*domain adaptation*- This one’s common in the machine-learning literature; I think it came from Hal Daumé III’s paper, “Frustratingly easy domain adaptation” in which he rediscovered the technique; he also calls logistic regression a “maximum entropy classifier”, like many people in natural language processing (and physics)
*variance components model*- I just learned this one on the Wikipedia page on random effects models
*cross-sectional model*- apparently a thing in econometrics for groups of points measured at the same time; opposite of a time-series in which measurements invove mutliple times
*nested data model, split-plot design, random coefficient*- The Wikipedia page on multilevel models listed all these.
*iterated nested Laplace approximation (INLA), expectation maximization (EM), …*- Popular algorithmic approaches that get confused with the modeling technique.

I’m guessing the readers of the blog will have more items to add to the list.

**If you liked this post**

You might like my earlier post, Logistic regression by any other name.

I’ve often felt there would be some service in doing something like the above for a variety of models: providing a rosetta stone to illustrate the different names for the same model across disciplines, alongside the different assumptions/data structures that are implied.

Is empirical bayes really referring to the type of model? I always thought of it as more of an estimation technique.

What’s similarity between all these names? Partial pooling. Maybe we should just call them models for pooled/longitudinal data.

Empirical Bayes is just penalized max marginal likelihood. So you have to specify which variables to marginalize, but otherwise it’s just (the first step of) an estimation technique.

Bob:

You write: “Empirical Bayes is just penalized max marginal likelihood.” Sometimes, but not always; see discussion here.

There are also other meanings for “hierarchical model”

for example, “Statistics for Spatio-Temporal Data” (Cressie, Wikle, 2015) uses the term “hierarchical model” for something I would call Hidden Markov model

I’ve also seen the term “hierarchical regression” used to refer to a sort of stepwise model where coefficients are grouped into “blocks” and added to the model one at a time in the assumed order of causal priority. This of course has nothing to do with nested data or random effects or anything, but it’s just another way these terms get thrown around.

by the way, it is not so simple to find the definition of “Hierarchical model” as Andrew uses it.

Like, can you point to the single simple definition?

Mikhail:

Either: a model for data that are clustered into (possibly non-nested) groups.

Or: a model where the parameters are themselves batched into batches that are modeled.

Sorry about saying “batched into batches”; I can’t say “batched into groups” because I already used “groups” in the first definition. I try to be unambiguous: data are in groups, parameters are in batches.

Well, I kinda understand what it mean (the term “Hierarchical model” is the most self-explanatory term in the list). But I was searching for some more mathematical definition. You know, the model can be called hierarchical if parameter vector X can be split into groups X_1, X_2 and so one, data Y can be split into X_1, X_2 and so one, such that P(Y_1|X_1), P(Y_2|X_2) … are separate models. Something like this.

I was searching for a definite definition of a “Hierarchical model” to refer when I was writing my thesis intro. But I found none. I guess if I would be writing another theoretical text, I would have to quote this comment.

I took the easy way out and defined it circularly.

Technically, I think of a hierarchical model as one involving a hierarchical prior. A hierarchical prior is one we fit jointly with the item-level parameters. For example, in the simplest regression case,

alpha[1], …, alpha[N] ~ normal(mu, sigma)

mu ~ normal(0, 2)

sigma ~ normal(0, 2)

defines a joint density p(alpha, mu, sigma) for “item”-level parameters (alpha[1], …, alpha[N]) and “population”-level parameters (mu, sigma). I use scare quotes here, because these levels can keep nesting (students in classrooms in schools in districts in states in countries in …) so there’s no natural item-level or population level.

There’s also confusion regarding the word “nested.” For example in one paper I wrote something like “we fit a 3 level multilevel model because patients are nested within surgeons who are nested within hospitals”

Then a reviewer said that the data aren’t nested because surgeons can operate at multiple hospitals. We responded with saying that we used the MCMCglmm package which accounts for the cross classification when estimating the random effects and changed the word “nesting” to “clustering”.

Thoughts?

Adan:

In our book, Jennifer and I would refer to that example as non-nested.

Fair enough. Clustering makes more sense to me anyway.

How does Stan deal with cross classification? I am currently working on getting data ready to use Stan for the first time. Does it estimate one random effect for a surgeon who operates at multiple hospitals? MCMCglmm estimates one random effect for this surgeon, but I believe other software like SuperMix (which uses empirical bayes). I know I know, SuperMix is written in fortran but Im just curious.

I meant to say other software like SuperMix estimates separate random effects for surgeon A at hospital A and surgeon A at hospital B*

Adan:

Stan will do whatever you tell it to. You can fit a model in which parameters vary by surgeon and are preserved across hospitals, or a model where surgeon parameters can vary by hospital. You can also do fit such model using Stan from R using stan_glmer() in rstanarm. Or using brms.

Cool. After many years of knowing that Stan is superior to MCMCglmm, I am now going to actually stop being lazy and learn how to do it in Stan.

Will send you paper when it is accepted.

-Adan

Or when it’s rejected, whichever comes first.

BAHAHAHA it will surely be rejected at least once before it gets accepted. Journals such as JAMA often don’t appreciate rigorous methods or wouldn’t appreciate why we have these conversations. Or maybe my paper really is not all that important… Regardless Ill be happy to send it to you!

Making no claims for superiority of Stan vs MCMCglmm, just here to point out that MCMCglmm can also do whichever of those things you want. Don’t quote me on it, but I think random=~surgeon+hospital should give you a single value per surgeon, whereas random=~hospital+hospital:surgeon would give you “surgeons nested within hospitals”, i.e. a different value for the surgeon in each hospital.

You’re probably right. I do the former which would estimate one ranef per surgeon even if they operate at multiple hospitals.

I use these models for profiling. If surgeons were to operate on avg patient what would be mortality prob across all surgeons

Side note:

BYE BYE inverse wishart priors :)

And … MCMCglmm can do “parameter-expanded” priors, which corresponds to half-t priors for standard deviation parameters (i.e. for scalar random effects terms) and something analogous (???) for vector-valued/multivariate random effects terms … see section 8.0.2 of the “CourseNotes” vignette …

Yeah the default is inverse wishart. Can certainly do scaled f but takes some work. I’m just too lazy I want something EZ

Done. That was relatively easy. I still need to do a lot of model diagnostic checking and will probably add more variables but I got the model fitting down. Relatively easy. One all nighter did it.

-Adan

Adan:

Perhaps even especially for clinical research, you might wish to go heavy on the principled Bayesian workflow.

My into for those who tend to be “lazy” – my talk at https://www.youtube.com/watch?v=I7AVP9BCm1g&feature=youtu.be (also access to code and slides) – or more definitive stuff by Michael Betancourt at https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html

And you are only expecting one rejection ;-)

Keith:

Wow, thank you so much. That’s vbery nice of you, I appreciate it. I will certainly go through the details and willing to learn the workflow.

The fact that you used the word nested incorrectly might indicate that you are confused. But it doesn’t permit you to assert that “ there is also confusion over the word “nested””. I am not confused.

You are correct. I was confused and I was wrong.

Luckily peer review worked by identifying the mistake and allowed us to make the easy word change from patients nested within surgeons nested within hospitals to patients clustered within surgeons clustered within hospitals. And now its published.

https://www.ncbi.nlm.nih.gov/pubmed/31082909

Thank you for pointing this out. Science progresses

On it’s own ‘cross-sectional time series model’ doesn’t imply hierarchical structure — they come in ‘fixed-effects’ and ‘random-effects’ versions, where the fixed-effects version estimates a separate intercept (or more parameters) for each series and the random-effects version models them as draws from a distribution.

Thanks for the clarification. Ive seen econometricians use fixed effects model terminology but it resembled a multilevel model that allows intercepts to vary. At first I thought econ fixed effects model meant including the clustering variable as a parameter in the model. So for an example where patients are clustered within surgeons, I thought that the econ fixed effect model would include surgeon ID as a fixed parameter (which would estimate k-1 parameters where k is number of surgeons), but now I see what you mean… I think lol

“How did I miss this” I hear people crying… including me. Andrew does a good job of clearing up the issue below: effects vary by group in both settings but no pooling versus pooling, or another way to think about it is no probability model vs. yes probability model

Again “how did I miss this”- I ask myself

https://statmodeling.stat.columbia.edu/2015/03/22/no-fixed-random/

Adan: Also this may be of interest https://statmodeling.stat.columbia.edu/2019/09/18/all-the-names-for-hierarchical-and-multilevel-modeling/#comment-1122518

Thanks—I can’t even back out where I saw that term when I was looking for synonyms.

Bob:

You write, “’empirical Bayes’ means using maximum marginal likelihood…” Not quite. Empirical Bayes has different meanings. One meaning of empirical Bayes is hierarchical modeling using point estimates for the hyperparameters (not necessarily marginal maximum likelihood, by the way; traditionally it’s been common to use moment-based estimates). In other settings, empirical Bayes refers to a more general conceptual framework in which intermediate-level parameters are considered as latent or missing data and averaged over, whereas hyperparameters are considered as fixed quantities to be conditioned on. (And then there are the people who say you can’t condition on a fixed quantity; don’t get me started on that.) In any case, you’re right that I don’t like the term “empirical Bayes” because it seems to imply that plain old “Bayes” is not empirical.

OK. Sounds like yet another case of vague and ambiguous naming in the stats literature. The Wikipedia entry for the Empirical Bayes Method cites Chris Bishop’s book for the definition (their [1], my emphasis):

I couldn’t follow the Casella (1985) paper or even the Efron and Morris (1975) paper as they never isolate the technique per se, but instead introduce it implicitly along with their particular application. I will never understand why statisticians can’t just define something clearly. They keep saying “empirical Bayes”, but never stop to define it other than by example.

Of all these names, the first I learned was “ components of variance” or “variance components”. If you look at any text book published prior to about 1960 this is what it was called. In those pre computer days books dealt mainly with balanced situations, and it was presented as a technique that enabled one to estimate the variance within each level of a hierarchy. It was/is very useful for understanding industrial processes for which the final product was too variable. It enabled one to gain an insight into those parts of the process that were causing this variability.

This is a very helpful post.

For quite a while, I thought “hierarchical models” were some exotic new technique, and that I had no hope of understanding them. Turns out that I already knew about them, just under a different name.

Terry:

Lots of treatments of hierarchical models, including my book with Jennifer, have oversold the idea by treating the methods as automatic and emphasizing that the group-level variance parameters can be estimated from data alone.

Our focus on this point was understandable as a reaction to earlier Bayesian work that considered the group-level variance parameters as part of the “prior” and thus required to be pre-specified in the analysis. I think we were making a valuable point to emphasize that these hyperparameters can, indeed, be estimated from the data.

But when the number of groups is not large, you can’t estimate the group-level variance parameter accurately from data alone, and it can help, for both computational and inferential reasons, to guide this inference using prior information. So now we’re moving toward a general recommendation to use informative priors when fitting multilevel models. That’s what we do in rstanarm, for example.

How did I realize the problem with purely data-based estimates of group-level variance parameters? I realized after seeing Jeff and Justin fit the same hierarchical regression model to a series of datasets (as part of an MRP project). We noticed the estimated group-level variance parameters jumping around from dataset to dataset, resulting in some substantively incoherent inferences.

Consider a simple case of a state-level variance parameter. When it’s estimated to be near 0, the state intercepts are partially pooled toward the national model, which might predict the state-level intercept based on Republican vote share in the previous election. When the state-level variance is estimated to be a high value, the state intercepts are not pooled, so these estimates will vary much more by state. Having vastly different amounts of pooling from dataset to dataset can wreak havoc on inferences over time. Ideally one would embed all this in a time series model, but in practice we’re often secret-weaponing it, fitting the model separately to each dataset, and we’ll want to regularize this estimate with an informative prior on the group-level variance.

The point is that if you fit a model, or a series of models, on just one dataset, everything can seem to make sense. But when you fit the model repeatedly to different data, the noisiness of the estimate can be more apparent.

Our perspective, and our practice, has changed. We oversimplified and oversold. And now I think we’re doing better.

P.S. OK, I guess this should be its blog post. It should appear in March or so.

Terry: There are also philosophical views (the meta-theory of “hierarchical models”) – http://www.stat.columbia.edu/~gelman/research/unpublished/Amalgamating6.pdf

My professor for a Multilevel Models class at NYU (Marc Scott) had a great wisecrack about this at the beginning of the semester. It was something like:

“Multilevel Models are like 2 or 3 hard concepts and lots of different names and notations for everything.”

We had a lot of fun with writing things several different ways as an exercise in translating between disciplines.

Domain adaptation is closely related to generalization of findings or transportability “a la Pearl”

Actuaries call it “Credibility Theory”