A question on multilevel modeling with correlated outcomes

Someone named David who wants to keep his last name secret writes:

A pretty common type of analysis in cognitive neuroscience is “representational similarity analysis.” Essentially, this means trying to model the similarity function, s_ij, between the multivariate neural response (e.g., several fMRI voxels in a particular brain region) at two time points t_i and t_j. Many similarity measures are used but (Fisher-transformed) Pearson correlation is a pretty common one.

This kind of analysis is used all over cog neuro but in my field of memory people often use it to measure when/what aspects of a previous experience get reactivated and the effects of that on later behavior. We might present a bunch of images to someone and then, after a delay, see which ones they recognize as having previously seen. We can look at times during the delay period, t_d, and compare them to when a particular image i was presented (at t_i). We might then want to look at whether images that people later recognize are more strongly reactivated during t_d. We can compute the similarity between the delay period activity and the initial presentation of image i, s_id, and then fit a model s_id ~ recognized_i. Even better, since we often have several subjects and several sets of images per subject (aka trials), we could fit a mixed effects model, allowing the intercepts and slopes to vary by subject, trial, etc.

The issue with fitting a straight linear model is that these s_id values are not independent. Often, if s_id is high for one image i, then s_jd will be low for another image j since the representations for the images are often unrelated to each other and time d is used for both similarity measures. This induces high correlations between the slopes and intercepts, making it hard to fit these models (lmer often results in singluar fits and rstanarm/brms takes forever because of the high correlations).

Do you happen to know of a better way to fit this type of model? Maybe it would be possible in Stan to fit the whole correlation matrix (across all image presentation time points i, j… and also delay time d) as an outcome variable in a GLM? But that seems very complex and possibly hard to interpret.

This is an example of a problem that I think I could definitely help solve if I were sitting next to the researcher, who could sketch some pictures, answer questions, make some plots, etc., but which I can’t really do much with from the above description because I’m just too unfamiliar with the details. The bit about rstanarm/brms taking forever makes me think there’s something wrong with how the model was set up—I’m thinking folk theorem here.

Anyway, I just wanted to share this with you, as it’s kind of interesting that there’s this large class of problems that I know I could help with, but I don’t know the exact steps of workflow that I would use. It’s kind of as if . . . I dunno, as if I’m an auto repairman and someone calls me up with a bunch of details about his car, and I can’t quite follow what he’s saying but I’m pretty sure I could fix the damn thing if he were to bring it into the garage. I can see how this can be frustrating to people. There are a lot of examples in our books, but unfortunately we don’t have examples for everything!

9 thoughts on “A question on multilevel modeling with correlated outcomes

  1. I think the best analysis of this problem is due to Mingbo Cai & colleagues here

    https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006299

    pretty much nails what you are asking about afaict, though I don’t have experience with whether the solution (a big Bayesian multilevel model of course, though I think they estimate it with some sort of approximate inference rather than sampling) is actually practical.

    • Thanks so much for the pointer! I knew of the Cai et al. paper but hadn’t read it in a while so definitely a great opportunity to go back… I definitely might be wrong but I think the problem Cai deals with is slightly different. My understanding is that Cai et al point out that any point estimate of trial activity will be by default more correlated with some trials than others and if this has any relationship with the RSA model, there will be a spurious result that “this brain signal represents your preferred description of the space.”

      The problem I mentioned I think is slightly different and doesn’t depend on misestimation of the trial response. A lot of memory research compares the same time point A (e.g., prior to retrieval or during some delay) to several others (e.g. encoding time points B and C). Even if you estimated the neural response at each time point perfectly, the difference in correlation(A, B) and correlation(A, C) (often a measure of the “reactivation of B at time point A”) will be constrained by the correlation(B, C). If we are just trying to fit a linear model where the outcome/y variable is correlations with A (e.g. correlation(A, i) ~ memory performance_i) it will be biased by not including information about correlations between all of the i’s. And even if we are fine with that bias, it makes fitting a multilevel model to these correlations hard because of the correlations in parameter estimates (although Ben Goodrich points out below that that may not be the case in the rstanarm/brms parameterization…).

      It is definitely possible that I’m totally wrong and this somehow reduces to the problem that Cai et al. were discussing so, if so, I’d be very glad to hear it!

  2. > rstanarm/brms takes forever because of the high correlations

    While rstanarm may take a long time to draw from the posterior distribution of some multilevel model, rarely is it due to the fact that the slopes and intercepts are highly correlated because rstanarm (and brms) uses a noncentered parameterization for them.

    > Maybe it would be possible in Stan to fit the whole correlation matrix (across all image presentation time points i, j… and also delay time d) as an outcome variable in a GLM?

    It wouldn’t be a GLM but it would be a model that can be done in Stan, although using the covariance matrix of the data with a Wishart likelihood or something would presumably be better.

    • Thanks so much for this, this was very helpful! I guess maybe the reason for the slow fit then has more to do with the misspecification than the correlation? Is there an example anywhere that you know of where this kind of Wishart model has been used?

      • It would be similar to structural equation modeling with latent variables where people model the covariance matrix of the data rather than the data vectors themselves. The scientific question is how to construct the expectation parameter in the Wishart distribution.

    • It wouldn’t be a GLM but it would be a model that can be done in Stan, although using the covariance matrix of the data with a Wishart likelihood or something would presumably be better.

      Could you elaborate on that? It sounds like you’re describing a parameterization where you can just model the covariance matrix between coefficients, and don’t have to compute their values at all? But I’m having trouble imagining how the algebra works out in a hierarchical model

  3. > then s_jd will be low for another image j since the representations for the images are often unrelated to each other and time d is used for both similarity measures

    If we know that the reason the activation for j is low given i is high — like somehow should this be baked into the model if this explains a bunch of variance? Do we have groupings of images or something? Or are these groupings something that is trying to be learned?

    I also don’t know about this application and so could easily be very off base here or misunderstanding the setup.

  4. My first instinct is to say QR parameterization, though I’m not sure how much it helps when your beta “coefficients” are a matrix rather than a vector.

    My second thought is that that it sounds like (though I could be wrong) you’re assuming uncorrelated noise between voxels. My guess is that error in one voxel is strongly correlated with another voxel, and you might want to use multivariate noise with a manually specified covariance matrix, where the error covariance between two voxels is determined by a matern kernel on the euclidean distance between them or something like that.

  5. A speculative guess: this is similar to a problem in reinforcement learning, the sparse rewards problem. That is, when the output, a reward, is many steps away from the input, the result is that rewards that are closer receive a higher function score. The result is that games in which rewards are many steps from each other, and have to be similarly accumulated, the result is incorrect correlations.

    The most common approach to the problem is reward shaping. Latent rewards are placed between the input x, and the sparse reward y, in order to find a fit that best accounts for the misappropriation of the sparse reward. (This method is difficult and can only work in some cases. But, it is intuitive and helps to redefine the model and data set as those with variables that have the necessary fit between variables.)

    In any case, this paper presents a broad overview of some popular methods of how to deal with sparse rewards:

    https://medium.com/@m.k.daaboul/dealing-with-sparse-reward-environments-38c0489c844d

Leave a Reply

Your email address will not be published. Required fields are marked *