Mark Tuttle points us to this project by Martijn Schuemie and Patrick Ryan:

Large-Scale Population-Level Evidence Generation

Objective:Generate evidence for the comparative effectiveness for each pairwise comparison of depression treatments for a set of outcomes of interest.

Rationale:In current practice, most comparative effectiveness questions are answered individually in a study per question. This is problematic because the slow pace at which evidence is generated, but also invites reporting and publishing only those studies where the result is ‘statistically significant’, leading to an underestimation of the true number of tests performed when correcting for multiple testing. This process is known as publication bias. Moreover, these studies typically do not include the evidence needed to interpret the study results, such as empirical estimates of residual bias inherent to the study design and data used. A solution to these problems is to perform a large set of comparative effectiveness analyses in one study, where each analysis adheres to current best practices. One of these best practices that we’ll follow is to use large scale propensity models to adjust for confounding. Another best practice that this study will follow is that each analysis will include a large set of negative and positive control outcomes (outcomes that are respectively not known or known to be cause by one exposure more than the other). In this study we would like to demonstrate the feasibility of generating population-level estimates at scale by focusing on on disease: depression. We perform every possible pairwise comparison between depression treatments for a large set of outcomes of interest. Most of these outcomes are generic safety outcomes, but some outcomes are related more specifically to the effectiveness of antidepressant treatment.

I don’t know anything about depression treatments, so all I have to offer is the general suggestion of analyzing all these comparisons together using a multilevel model, as discussed in this 2011 paper with Hill and Yajima. Then you get all the comparisons automatically.

> Then you get all the comparisons automatically.

I still haven’t quite figured out how to do this “automatically” in a Bayesian multilevel model. Imagine I have a treatment or control condition (X) and then a bunch of covariates (Z): gender (2 levels), education (5 levels), race (4 levels), political party (3 levels), income (5 levels).

Option 1: We could make a grouping variable of every combination, for 600 unique groups. The rstanarm::stan_lmer code would be: y ~ x + (1 + x | group). We could look at the random slope coefficient of x and see which groups we are confident there is an effect vs. not, or do specific contrasts in comparing groups. However: (a) This doesn’t lead itself to easily looking at marginal comparisons, such as men, black men, white women democrats, high income people, and (b) the groups become very sparse—some might have only 1 person in the group, making the calculation of a slope for that group impossible.

Option 2: Crossed random factors design, which would be y ~ x + (1 + x | gender) + (1 + x | education) + (1 + x | race) + (1 + x | political_party) + (1 + x | income). However: (a) this doesn’t lend itself to finding more granular combinations, such as upper class white women or lower income college educated men, and (b) there are so few levels within groups that it becomes difficult to estimate a variance. Such as, how do we estimate a variance of the intercept and slope when there is just two gender groups? I get divergent transitions in rstanarm doing this.

Mark:

I said it would be automatic, but not that it would be easy!

Seriously, the automaticity of the inferences is all conditional on the model, and you’re right that we can’t grab a model off the shelf for this problem. My recommendation would be something like your Option 2 but with some changes for computational efficiency. Just to start, I’d treat political party as a continuous predictor rather than as a factor, and I’d treat male/female as a linear predictor, not a factor with two levels. I might also want to use informative priors for the group-level variance parameters.

This question comes up a lot. To start with, I’d want to go through one example from beginning to end, which I haven’t done yet.

> To start with, I’d want to go through one example from beginning to end, which I haven’t done yet.

I would very much be interested in this!

Depression is a notoriously difficult condition to try and analyze over trials. There is huge trial to trial for the same drug/same dose against placebo. Placebo response can be very large. Over time, the patient population in these trials has changed both diagnostically and wrt to history of disease. I would have suggested schizophrenia as opposed to depression if they want to examine mental health. Maybe a good concept but a lousy target population?