Skip to content

Fitting a hierarchical model without losing control

Tim Disher writes:

I have been asked to run some regularized regressions on a small N high p situation, which for the primary outcome has lead to more realistic coefficient estimates and better performance on cv (yay!). Rstanarm made this process very easy for me so I am grateful for it.

I have now been asked to run a similar regression on a set of exploratory analyses where authors are predicting the results of 4 subscales of the same psychological test. Given the small sample and opportunity for type M and S errors I had originally thought of trying to specify a multivariate normal model, but then remembered your paper on why we don’t usually worry about multiple comparisons.

I am new to translating written notation of multilevel models into R code, but I’m wondering if I’m understanding your eight schools with multiple outcomes example properly. Would the specification in lmer just be:

y ~ 1 + (1 + B1 + B2 | outcome)

Where outcome is my factor of subscales, y is the standardized test outcome, and B1 and B2 are standardized slopes I want to allow to vary by subgroup? This seems to make sense to me in that it’s coding my belief that the slopes between subgroups are similar (and thus hopefully pulling extreme estimates closer to the overall mean), but it seems too easy, so I figure I must be doing something wrong. The results also end up leading to switching signs in in the coefficients when compared against the no pooling results. Not sure whether to be excited about potentially avoiding a type S error, or scared that I’ve stuffed up the whole analysis!

My reply:

That looks almost right to me as a starting point, but one thing it’s missing is the idea that the 4 subscales could be correlated. Perhaps people with higher scores on subscale 1 also tend to have higher scores on subscale 2, for example?

How best to model the correlation? It depends on what these subscales are doing. Most general is a 4×4 covariance matrix (which, incidentally, allows the variances to be different for the different subscales, something not allowed in your model above), but something sort of item response model could make sense if you think all the subscales are measuring related things.

In any case, I guess you could start with the model above but then I’d move to fitting a multivariate-outcome model in Stan.

Finally, regarding the larger question of making sure that your model is doing what you think it’s supposed to be doing: I very much recommend fake data simulation. Set up your model, do a forward simulation and create fake data, then fit the model to your fake data and check that the results make sense and are consistent with what you were assuming.


  1. Tim says:

    Thanks Andrew,

    I wonder if you or any of your readers could identify any seminal texts/repositories for building up simulation skills?

  2. Ben Goodrich says:

    The stan_mvmer function in rstanarm can model up to 3 outcomes simultaneously.

Leave a Reply