Someone who wishes to remain anonymous writes:

Lacking proper experience with multilevel modeling, I have a question regarding a nation-wide project on hospital mortality that I’ve recently come into contact with. The primary aim of the project is to benchmark hospital performances in terms of mortality (binary outcome) while controlling for “case mix”, that is, the particular mix of patients (in terms e.g., age, gender, comorbidity, …) in the hospital. But it is not unthinkable that the project can lead to e.g., early warning systems. In short:

The data:

– about 1e6 records for all patients for the last 10 years in all participating hospitals, including among other things the outcome, diagnosis, age, gender, comorbidity, treatment year, …

– there are two obvious higher level groupings: hospitals (n = +-30) and diagnosis groups (+-300 groups, all containing at least about 1000 observations)The analysis at this point:

– a separate logistic regressions for each diagnostic group

– LOOCV for variable selection for each regression

– Hospitals are not included as factor in the analysesThe benchmark:

– Hospitals are benchmarked for outcomes in a particular diagnostic group using the appropriate regression model on the basis of their patients in the selected period (thus, the prediction is tailored to their patients), comparing observed and predicted mortality rate.Having some background in statistics (in a previous life I have read your book on multilevel modeling), I wondered whether:

– multi-level analysis can add anything over and above the current approach. I can imagine there will not be much shrinkage given the large numbers in each group, but still, I have the itching intuition that some effects (e.g., age) can be better gauged using all the data. Thus, a model with varying intercepts (diagnostic groups) and at least some varying slopes (for example, age, age^2, comorbidity, …)

– excluding the hospitals makes sense? My first reaction was: use them in the regression, evaluate the effects, and if required, make predictions without their coefficient.

My reply:

You could have a billion or a trillion observations but you’d still have only 10 years of data, so I’d expect this variation over time to drive your uncertainty. So, yeah, I’d put in varying intercepts for time, hospital, and their interaction, for sure. Probably some varying slopes would make sense too.

There are two things to worry about here: modeling and computation.

Setting computation aside for a moment, you’d definitely want to put in those varying intercepts as a start. One thing we haven’t worked out so well is workflow: what varying slopes to include, how much to build out your model before it becomes so large that it’s hard to understand, etc. It could also make sense to use informative priors for the group-level variances. And once you’ve fit your model, you can plot time series of those varying intercepts, for each hospital. Also plot those intercepts for the diagnostic groups.

The other thing is computing. Lots of issues here. Again, we need a clear workflow.

**P.S.** I wrote the above a few months ago. Since then, my colleagues and I wrote this article on Bayesian workflow.

One thing I’ve struggled with on the computational front, specifically working with multiomic data that might include thousands of individual genes or molecules whose effects I might wish to partially pool across: Once I’ve specified all my desired hierarchical model with hyperparameters on all the main effects and interactions of interest (across, say, genes, cells, tissues, sexes, timepoints, and individuals), the full model might contain millions of nominal parameters being fitted to tens or hundreds of millions of observations. Even with various efficiency tricks & reparameterizations, within-chain parallelization, etc., this ends up being utterly intractable to fit in Stan. Assuming I’m otherwise philosophically OK with such a parameter-rich model given all the shrinkage that should occur from the multilevel structure, what’s my recourse? (the more conventional solution being to fit millions of independent models and rely on the large sample sizes to give you enough power to survive multiple comparisons adjustment, and then apply a significance filter :S)

One idea I’ve had has been empirical Bayes flavored — fit the full model to lots of random subsets of the data, say 100-genes at a time, and then characterize the distribution of estimated model hyperparameters. Then, fit each independent model (gene-by-gene, say) given whatever informative hyperprior is induced, adding another level to the hierarchy. Would this be a reliable approximation to the full hierarchical model? Is there a better alternative approach?

(alternatively, I’ve also considered fitting a buncha sub-models to the full data, each with some subset of the focal parameters, and then trying to interpret focal quantities across posterior predictive simulation averaged across them, which I’d usually have implemented and fit anyway during iterative model construction, but that seems less satisfying. And then the last approach I’ve used is just to fit multiple chains for weeks at a time — but that sucks!)

> this ends up being utterly intractable to fit in Stan

Intractable because it is too slow? Or lots of divergences and hard to figure out what’s going on? Or too difficult to program?

Generally too slow to mix over the super high-dimensional parameter space! Sometimes I get lots of divergent iterations but then usually increasing the target acceptance rate and using a non-centered parameterization is enough to fix that. Otherwise it just takes too long to fit (-> pass MCMC diagnostics across multiple independent chains), having to sample a million parameters at a time or whatever. I’m not sure what’s meant by programming difficulty, but I can specify & implement the model in Stan with fewer instances of the different categories, it’s just scaling it up to the full parameter space that makes it slow (like, getting 10000 iterations takes months to finish, slow).

Nikolai:

You do not want to run for 10,000 iterations. Can you try constraining the model with strong priors? Another approach is to split and combine the data, as here: http://www.stat.columbia.edu/~gelman/research/published/ep_life_jmlr.pdf

Interesting, thank you! The Expectation Propagation approach described sounds promising, but I’ll have to give it a closer read to really understand all the fiddly bits (some parts are more familiar to me than others, e.g. in grad school I spent a lot of time exploring proposal distributions that respect the PSD constraint of high-dimensional correlation matrices, so the struggle in 5.3 echoes old frustrations haha).

Out of curiosity, is the desire to avoid just throwing longer chains at an inference problem motivated by concerns for efficiency, or something else? Using Met-Hastings in phylogenetics it was customary to just run longer and longer analyses, hundreds of millions of iterations long, when some complex posterior geometry proved especially unyielding. Algorithmic details notwithstanding, are there other concerns with that strategy when using HMC / NUTS? Specifically, the model that I’m hoping to fit is this one (https://pastebin.com/Wr32H1Z9), the sum of a logistic growth and decay function, except with varying intercepts (of dimension bajillion) on various parameters. In simulation, I find the (non-hierarchical) model fits super easily when presented with lots of data (like 128 whole observations… though with lots of non-identifiability across certain focal parameters, e.g. https://i.imgur.com/HVWW8Or.png), but with few data points (e.g. https://i.imgur.com/6e8NqOt.png — 90% HPDIs and posterior predictive intervals of a posterior fit plotted), I find that I’m often not passing even the most basic of diagnostics until hitting 100s of thousands of iterations (having set up a little while() loop to do some automated checking and increase adapt_delta, chain length, max_treedepth, etc. as warranted, depending on what runtime warnings get reported).

(this is all on lots of replicated simulated data to explore the statistical properties / viability of this model, before applying it to the big empirical data at hand — the hope being that the strong priors on the bottom-level parameters get learned in the course of fitting)

But the full hierarchical model won’t ever actually sample, because it seems like it’s too big (given prior experience working with smaller data in Stan). Even something like:

y ~ normal(mu, sigma) //assume all y are mean-centered and rescaled to unit variance

mu = a_i + b_i*x //ditto all x

a_i ~ normal(0,sigma_a)

b_i ~ normal(0,sigma_b)

sigma_a ~ half-cauchy(0,1) //or I guess half-cauchy(1) depending on taste

sigma_b ~ half-cauchy(0,1)

Seems like it would never get off the ground if _i indexes, like, a million groups. Right? Thus my initial thinking that if in the simple linear model above, sigma_a and sigma_b can be reliably learned from just some small subset of _i, I could do away with the hierarchical structure and fit the wholly flat model in parallel on each group simultaneously. If all the a_i and b_i only affect each other through the sigmas, there’s no benefit to fitting them jointly?

Yeah months is too slow.

> I’m not sure what’s meant by programming difficulty, but I can specify & implement the model

Just checking that you were able to write the model without too much difficulty (if it took months to write it out, for instance, that is too long).

> are there other concerns with that strategy when using HMC / NUTS

If it takes months to do a calculation, then that doesn’t leave much time to check that the calculation is for sure doing what you expect.

So what you said:

> this is all on lots of replicated simulated data to explore the statistical properties / viability of this model

Like you wanna be doing as much as that as possible regardless of what Stan’s Rhats/ESS estimates say (and you’d do it for individual fits and you’d do it for EP too) and if that takes months oof.

As far as crazy long chains, I don’t think there’s anything special about NUTS there. Just the same, generic, if-you’re-having-to-run-this-long-why-not-longer-maybe-you-haven’t-mixed concerns. Multiple-chains + Rhat is the workaround.

> sigma_a ~ half-cauchy(0,1)

> sigma_b ~ half-cauchy(0,1)

The half-cauchies can be hard to work with. Bob has a note here: https://statmodeling.stat.columbia.edu/2021/03/25/the-folk-theorem-revisited/#comment-1768023

> sigma_a and sigma_b can be reliably learned from just some small subset of _i, I could do away with the hierarchical structure and fit the wholly flat model in parallel on each group simultaneously

That seems nice cause it would be easy to code. Also you could pull estimates of sigma_a and sigma_b before the full fits mix and then use that for a full fit with fixed hierarchical parameters.

If the model is that simple, then if you have group members with lots of data and some without, then you might need to mix non-centered and centered parameterizations. Like 8-schools model if the 8 measurements were really precise is better done with a centered parameterization.

I was thinking more about my answer with the months and I think it’s kinda wrong.

Months is still probably bad, but I think here’s a better way to get at that answer.

Running a model (whatever that means, and whatever algorithms or whatever we’re using) produces two kinds of information (using information informally):

1. Information useful for directly improving the model

2. Information useful for the application

If you’re doing biology, the biology info is the second. In the process of doing this, you might find divergences, or other computational or statistical problems with your model. This is the first type of information.

Before you really trust the second type of information your model generates (application info), you need to spend some time generating the first type information (modeling info). So even if we say “we can perfect our model with 10 units of information” and we generate 1 unit of information each iteration, then 10 informations -> 2.5 years before you even start generating that second kind of information that is the stuff that matters for your application.

The second piece here is that the product of running your model is information — that information derives value from its use elsewhere. Even if you tell me, “my higher level decision is simple, we can simply wait 2.5 years for the answer and be done with it” I am skeptical.

I think the reality is that this higher level decision is likely to lead to iterations in the types of information you want to extract about the problem at hand (sending you all the way back to the beginning). So a decision process that can support this sort of delay in information would need to have a timeline in the decades probably.

—

And I’m exaggerating numbers a bit there. You already said:

> (this is all on lots of replicated simulated data to explore the statistical properties / viability of this model, before applying it to the big empirical data at hand — the hope being that the strong priors on the bottom-level parameters get learned in the course of fitting)

So you’re using smaller experiments to generate #1 type information which is good but I think the relationship between #2 and the next level of the problem will be similar to the relationship between #1 and #2, so a unit of computation that takes 3 months -> bad.

>If it takes months to do a calculation, then that doesn’t leave much time to check that the calculation is for sure doing what you expect.

Oh ya, I hear ya haha. Nothing worse than waiting ages for a model to fit only to see it fail. Usually I’ve been able to get around that by iteratively complexifying it bit by bit, so that I can troubleshoot nested models on data subsets that fit much faster.

>you wanna be doing as much as that as possible regardless of what Stan’s Rhats/ESS estimates say

Agreed! I was just doing the automated diagnostics for when the chains failed to converge / mix (these are all across 4 independent chains, by default), that I might rerun them longer or under different conditions.

RE: half-cauchies, ya, I’ve had trouble with them before, and will usually swap ’em out on scale parameters for a halfnormal or exponential if I suspect they’re giving me trouble. The fat tails are a nice hedge but often not worth the trouble.

> That seems nice cause it would be easy to code.

OK, so it sounds like there’s nothing terribly unprincipled I’m not seeing wrt that sort of short-cutting! Thanks!

> Like 8-schools model if the 8 measurements were really precise is better done with a centered parameterization.

TBH I don’t have the best intuitions for when one is preferred over the other — usually I’ll just flip between them when having trouble, or else use the non-centered form when working with really crummy data. Should probably actually give that 2013 Betancourt and Girolami paper a read!

Agreed on your follow-up comment! And really, in this current case the waiting time between units of your first type of information would measured in centuries or whatever haha, so that’s why I’m looking for cheap workarounds!