Mark White writes:

I have a question regarding using multilevel models for multiple comparisons, per your 2012 paper and many blog posts. I am in a situation where I do randomized experiments, and I have a lot of additional demographic information about people, as well. For the moment, let us just assume that all of these are categorical demographic variables. I want to not only know if there is an effect of the treatment over the control—but for what groups there is an effect (positive or negative) for. I never get too granular, but I do look at an intersection between two variables (e.g., Black men, younger married people, Republican women) as well as just within one variable (e.g., women, Republicans, married people).

The issue I’m running into is that I want to look at the effects for all of these groups, but I don’t want to get mired down by Type I error and go chasing noise. (I know you reject the Type I error paradigm because a null of precisely zero is a straw-man argument, but clients and other stakeholders still want to be sure we aren’t reading too much into something that is not there.)

In the machine learning literature, there is a growing interest in causal inference and now a whole topic called “heterogeneous treatment effects.” In the general linear model world in which I was taught as a psychologist, this could also just be called “looking for interactions.” Many of these methods are promising, but I’m finding them difficult to implement in my scenario (I wrote a question here https://stats.stackexchange.com/questions/341402/a-few-questions-regarding-the-practice-of-heterogeneous-treatment-effect-analysi and posed a tailored question about one package to package creators directly here https://github.com/swager/grf/issues/238).

Turning back to multilevel models, it seems like I could do this in that framework. Basically, I just create a non-nested/crossed/whatever you’d like to call it model where people are nested in k groups, where k refers to how many demographic variables I have. I simulated data and fit a model here: https://gist.github.com/markhwhiteii/592d40f93b052663f240125fc9b8db99

The questions I have for you are the questions I pose at the bottom of that R script at the GitHub code snippet:

1. Is this a reasonable approach to examine “heterogenous treatment effects” without getting bogged down by Type I error and multiple comparison problems?

2. If it is, how can I get confidence intervals from the fitted model object using glmer? You all do so in the 2012 paper, I believe

3. More importantly, how can I look at the intersection between two groups? The code I sent in that GitHub snippet looks at effects for men, women, Blacks, Whites, millennials, etc. But I coded in an effect for Black men specifically. How could I use that fitted model object to examine the effect for Black men, White women, millennials with kids, etc.? And how would I calculate standard errors for these?

4. Would all of these things be easier to do in Stan? What would that Stan model look like? Since then I wouldn’t have to figure out how to calculate standard errors for everything, but just sample from the posterior.

My reply:

We’ve been talking about varying treatment effects for a long time. (“Heterogeneous” is jargon for “varying,” I think.)

From 2004: Treatment effects in before-after data.

From 2008: Estimating incumbency advantage and its variation, as an example of a before/after study.

From 2015: The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective.

From 2015: Hierarchical models for causal effects.

From 2015: The connection between varying treatment effects and the well-known optimism of published research findings.

From 2017: Let’s accept the idea that treatment effects vary—not as something special but just as a matter of course.

I definitely think hierarchical modeling is the way to go here. Think of it as a regression model, in which you’re modeling (predicting) treatment effects given pre-treatment predictors, so the treatment could be more effective for men than for women, or for young people than for old people, etc. You’ll end up with lots of predictors in this regression, and multilevel modeling is a way to control or regularize their coefficients.

In short, the key virtue of multilevel modeling (or some other regularization approach) here is that it allows you to include more predictors in your regression. Without regularization, your estimates would become too noisy, then you’d have to fit a cruder model, not allowing you to study the variation that you care about.

The other thing is, yeah, forget type 1 error rates and all the rest. Abandon the idea that the goal of the statistical analysis is to get some sort of certainty. Instead, accept posterior ambiguity: don’t try to learn more from the data than you really can.

I’ll start with some models in lme4 (or rstanarm) notation. Suppose you have a treatment z and pre-treatment predictors x1 and x2. Then here are some models:

y ~ z + x1 + x2 # constant treatment effect y ~ z + x1*z + x2*z # treatment can vary by x1 and x2 y ~ z + x1*x2*z # also include interaction of x1 and x2

If you have predictors x3 and x4 with multiple levels:

y ~ z + x1 + x2 + (1 | x3) + (1 | x4) # constant treatment effect y ~ z + x1*z + x2*z + (1 + z | x3) + (1 + z | x4) # varying treatment effect y ~ z + x1*z + x2*z + (1 + z | x3*x4) # includes an interaction

One thing we’re still struggling with, is that there are all these possible models. Really we’d like to start and end with the full model, something like this, with all the interactions:

y ~ (1 + x1*x2*z | x3*x4)

But these models can be hard to handle. I think we need stronger priors, stronger than the current defaults in rstanarm. So for now I’d build up from the simple model, including interactions as appropriate.

In any case, you can get posterior uncertainties for whatever you want from stan_glmer() in rstanarm; simulations of all the parameters are directly accessible from the fitted object.

You can also aggregate however you want. It’s mathematically the same as Mister P; you’re just working with treatment effects rather than averages.

Great blog post. I really like rstanarm, but a mention of brms might be good here as well. The ‘brms’ package is great for marginal effects plots of interactions, which might be useful to this person for some of question #3 that they had.

> The other thing is, yeah, forget type 1 error rates and all the rest. Abandon the idea that the goal of the statistical analysis is to get some sort of certainty. Instead, accept posterior ambiguity: don’t try to learn more from the data than you really can.

Measuring sensitivity and specificity is useful for binary decision problems like medical diagnostic testing. It’s not so useful for things like testing if a regression coefficient is non-zero—as Andrew keeps pointing out, we don’t really expect any of these effects to be exacty zero. Instead of type-I error rates, can you just report posteriors for parameters like regression coefficients and some indication of whether they’d make much of a difference in practice? That way you can deal with sign and magnitude and difference from zero in a unified framework. And it seems to be what people really want. That is, instead of using point estimates and reporting things like “two cups of coffee/day leads to 17 fewer days of life expectancy (all else being equal and average!)” that you see reported everywhere, including the popular press and clinical drug trials, you’ll at least get uncertainty in these predictions. The all-else-being-equal bit’s still a tough one, as it often mixes distinct subpopulation predictions.

One approach that you could try is the Piironen/Vehtari approach to variable selection. It’s like a souped-up L1 regularization that will effectively zero parameters that aren’t useful based on a clevere extension of the horseshoe prior. Also, Jonah Gabry and Ben Goodrich developed useful default priors that are also reasonably scaled as defaults; here’s the prior vignette from CRAN and here’s the general rstanarm::glmer vignette.

Good points.

Thanks for this!

call me skeptical of the entire enterprise. What we want is a model to estimate effect parameters with a probability distribution that is a function of causal conditioning. P(Y|do(X)). What we have in exercises like this are models that give us consistent estimates only for probabilistic conditioning P(Y|X). I’m using Cosma Shalizi’s language here. Sure, all the work in causal modeling shows us how to get consistent estimates under causal conditioning, but what I’m skeptical of is that we can actually do this in practice – that is we’re only going to approximate consistent estimates under fairly limited conditions. This is not a DAG vs. linear model argument – I think DAGs are nice theoretical tools but in practice don’t help for questions like Mark White’s. But of course if everyone thought like me then whole fields of study would end today!

I didn’t mean to comment as Anonymous – a student came in and I hit send without filling in the boxes!

Thanks for the response! So assume I have `y`, the outcome; `z`, the treatment indicator; and `x1` to `x4`, some categorical demographic variables. I could do something like: `y ~ z + (1 + z | x1 + x2 + x3 + x4)`, but when there are only 2 or 3 levels for each demographic variable, I worry that we don’t have enough groups to estimate the variance of the treatment effect. So if `x1` was `f` or `m`, then we would be trying to estimate a mean and a variance from two groups, right?

it also looks like the `beanz` R package implements a lot of these ideas: https://cran.r-project.org/web/packages/beanz/beanz.pdf (which is based primarily on Jones HE, Ohlssen DI, Neuenschwander B, Racine A, Branson M (2011). Bayesian models for subgroup analysis in clinical trials. Clinical Trials, 8(2), 129-143.)