Skip to content

His varying slopes don’t seem to follow a normal distribution

Bruce Doré writes:

I have a question about multilevel modeling I’m hoping you can help with.

What should one do when random effects coefficients are clearly not normally distributed (i.e., coef(lmer(y~x+(x|id))) )? Is this a sign that the model should be changed? Or can you stick with this model and infer that the assumption of normally distributed coefficients is incorrect?

I’m seeing strongly leptokurtic random slopes in a context where I have substantive interest in the shape of this distribution. That is, it would be useful to know if there are more individuals with “extreme” and fewer with “moderate” slopes than you’d expect of a normal distribution.

My reply: You can fit a mixture model, or even better you can have a group-level predictor that breaks up your data appropriately. To put it another way: What are your groups? And which are the groups that have low slopes and which have high slopes? Or which have slopes near the middle of the distribution and which have extreme slopes? You could fit a mixture model where the variance varies, but I think you’d be better off with a model using group-level predictors. Also I recommend using Stan which is more flexible than lmer and gives you the full posterior distribution.

Doré then added:

My groups are different people reporting life satisfaction annually surrounding a stressful life event (divorce, bereavement, job loss). I take it that the kurtosis is a clue that there are unobserved person-level factors driving this slope variability? With my current data I don’t have any person-level predictors that could explain this variability, but certainly it would be good to try to find some.


  1. Dean Eckles says:

    Besides using a finite mixture, one could use an infinite mixture (eg Dirichlet process mixture).

    Also, I’m looking forward to better support for discrete variables in Stan. Writing mixtures is currently quite cumbersome in Stan (compared with BUGS and JAGS).

  2. If you don’t have individual predictors that could help create an individual level model, the best you can do is create a frequency model for the size of the effects in the population where you have a representative sample.

    In Stan, you can provide a prior over the coefficients that is broad enough to cover the range of things that are reasonable, and then set each person’s coefficient to be from this common prior… Then you’ll have a posterior distribution over each person’s coefficient. But, pooling samples across all the people (integrating the probability over all the people) gives you a posterior distribution of coefficients for “a randomly selected person from my population” which need not at all be normally distributed, even if the prior over coefficients was, and even if the posterior over each person was.

    • In other words, this *is* a mixture model where you’re relying on the sampling of people to make the mixture have valid coverage for the population. If you know something about the individuals and can “poststratify” you can weight the individuals by a coefficient. To get a sample from the corrected population just repeatedly choose a person according to their poststratification weight, and then choose a sample of their coefficient. Repeat until you have a large sample of coefficients.

    • Corey says:

      Right, so the issue here is that the coefficient population is being implicitly modelled as normal, the only uncertainty being in the overall mean and overall variance; we can never update out of the normal population model. This model forces the amount of shrinkage to depend on the ratio of the sampling variance and (estimated) population variance. (This is what lmer does, albeit in a non-fully-Bayes way.) If the underlying population is leptokurtic, then normal model shrinkage will over-shrink the tails and under-shrink the center. But we’re told that even with the over-shrinkage of the tails, the coefficient estimates are too leptokurtic to trust the normal population model — so yes, the model should be changed.

      • In a Bayesian model where you say

        coefficient[i] ~ normal(0,1000);

        and then you have data that puts the likelihood of coefficient[1] concentrated around say 10, coefficient[2] concentrated around 12, coefficient[3] concentrated around 15, coefficient[4] concentrated around 300 (in a big tail)…

        the resulting mixture would be non-normal because although the plausibility you assign to coefficients[i] pre-data is normal, each one concentrates separately around whatever the data[i] suggests. It’s the prior that’s normal, and maybe it’s also the case that the individual posteriors are normal, but *not* the marginal posterior summed across the individual people.

        on the other hand, if you say

        coefficient[i] ~ normal(mu,sigma)

        and mu,sigma are themselves hyper-parameters, then you’re going to have potentially weird distributions over mu,sigma, and the continuous normal mixture you get there could be non-normal. for example a t distribution is a continuous mixture of normals that have some kind of inverse-gamma distribution on the sigma right?

        None of this is apropos of lmer, but in Bayesian calcs, the normality of the initial priors and of the individual likelihoods doesn’t imply normality of the finite mixture across the n coefficients. In fact, what you’re doing by fitting a gaussian to each person and then marginalizing, is in essence fitting a gaussian mixture model to the full population.

        I think when you say the model should be changed you’re suggesting to move from lmer to the full type of model I’m talking about here, or are you suggesting something else, for example that my suggested model has problems at the prior or hyper-prior level?

        • Corey says:

          When Doré writes:

          I’m seeing strongly leptokurtic random slopes in a context where I have substantive interest in the shape of this distribution. That is, it would be useful to know if there are more individuals with “extreme” and fewer with “moderate” slopes than you’d expect of a normal distribution.

          I take it to mean that what he cares about is not just the values of the coefficients for those individuals in his sample but rather the distribution of the coefficients in the general population of people who undergo stressful life events. If so, what he wants is the posterior predictive distribution of a new coefficient. Letting the index i refers to such a coefficient, then

          coefficient[i] ~ normal(0,1000)

          prevents any learning at all and

          coefficient[i] ~ normal(mu,sigma)

          necessarily has 0 excess kurtosis.

          • Corey says:

            Ack, stupid brainfart:

            coefficient[i] ~ normal(mu,sigma)

            doesn’t have 0 excess kurtosis; forgot that it gets mixed over the posterior of the hyperparameters. Dumb dumb dumb.

          • Sorry, I was just writing the priors, since I have no knowledge of what his likelihood looks like, so

            coefficient[i] ~ normal(0,1000) is the prior, and then presumably there some

            data[i] ~ somedistribution(f(coefficient[i]))

            which is where the learning comes in. Posterior distribution over the coefficient[i] will be concentrated around the result for that individual.

            Then, the marginal distribution of ALL the coefficients (for all i) is the distribution he’s interested in (ie. the posterior predictive distribution of a new coefficient)

            and, yes, the normal(mu,sigma) mixes across the hyperparameters.

            • Ok, I’m now seeing a good point though. In the case where the prior is normal(0,1000) and the posterior is mixed over the N different subjects, you get a finite mixture model for the coefficients. If the number of subjects is small, and the tails are heavy, you might not have a great model here, each subject being spread out it might look like a bunch of little discrete lumps.

              On the other hand,

              normal(mu,sigma) with broad priors on mu, and sigma, after applying the likelihood to the data, will produce posterior distributions on mu,sigma which give a continuous mixture that is perfectly capable of having the appropriate heavy tail. Then if he uses an extra parameter “randomcoeff” and says:

              randomcoeff ~ normal(mu,sigma)

              and gets a sample of “randomcoeff” it will be a sample from the posterior predictive distribution for a randomly chosen person from the population using the continuous mixture model.

  3. I may be misinterpreting this–but it comes as no surprise to me that the slopes would be highly variable here and that some would be strongly leptokurtic. It seems that the author is considering a wide range of stressful events at once. Some might have a long-term effect on life satisfaction; others, a large effect over a short time.

    A possible group-level predictor might be the effect of the event on one’s economic stability. That’s just a start. A job loss when you have plenty of money in the bank is different from such a loss when you’re steeped in debt.

    Another might be the nature of bereavement–in particular, its suddenness and unexpectedness.

    I could go on and on. It seems that the study may have overgeneralized stress and life satisfaction from the start. I may be wrong here; maybe the study does break down the various kinds of stressors and losses.

  4. Mark Benson says:

    I have a question as well. Something that seems very basic but that I could never find (maybe it does not exist?).
    I would like to have a cheatsheet to find a distribution.
    I am thinking of a decision tree..
    Something like this: Is your distribution discrete? => Yes => Does the events are success or failure such as a toss coin? => Yes => Do you run a single experiment? => Yes => Your distribution is a Bernoulli…explaining the distribution.
    It seems to me that a something like that must already be out somewhere but I really can’t find it.
    Did you ever run into such thing?

  5. Nick Menzies says:

    I am surprised not to see a suggestion to use a more flexible distribution for the random effects. Would another option be to assume the random effects were distributed Student’s t, with an uncertain hyperparameter for the df? I have seen this kind of model at the observation level but not at the group level before, so not sure if there is a catch.

Leave a Reply