Skip to content

Using the aggregate of the outcome variable as a group-level predictor in a hierarchical model

When I was a kid I took a writing class, and one of the assignments was to write a 1-to-2 page story. I can’t remember what I wrote, but I do remember the following story from one of the other kids. In its entirety:

I snuck into this pay toilet and I can’t get out!

In the discussion period, the kid explained that his original idea was a story explaining the character’s situation, how he got into this predicament and how he got stuck. But then he (the author) realized that the one sentence captured the whole story, there was really no need to elaborate.

(To understand the above story, you have to know the following historical fact: Pay toilets in the U.S., decades ago, were not the high-security objects shown (for example) in the picture above. Rather, they were implemented via coin-operated locks on individual toilet stalls. So it really would be possible to sneak into certain pay toilets, if you were willing to crawl under the door or climb over it.)

Anyway, this is all preamble to a very short statistics story.

Jessica Smith wrote in with the following question:

In multilevel modeling, is it appropriate to aggregate the outcome variable and include it as a control variable at the contextual level? For example, if you’re predicting depression as a person-level outcome, is it appropriate to control for average neighborhood-level depression? If it is or isn’t appropriate, is there something I can cite along these lines?

My reply: In this case, I think you have to be careful. It is better to avoid using a variable to predict itself.

There were all sorts of things I could’ve said, about simultaneous-equation models and measurement-error models and latent variables and time sequences. But, upon reflection, it seemed to me that the two-sentence answer said it all.


  1. John Hall says:

    I think that’s an interesting point. I’ve been thinking a lot about it lately with applications to asset pricing models. For instance, if you use a hierarchical model for equities, your argument is that you shouldn’t include the market index at a top level. This is in contrast to virtually all asset pricing models (like CAPM, Fama-French, APT) that regress individual equities against the market. What I haven’t figured out is how to account for the high beta/low beta aspect of things.

  2. konrad says:

    But that answer can be read as suggesting that neighbourhood-level effects should be _ignored_ – I take it that’s not what you would recommend? Sounds like a text book example for a hierarchical model (random effects / multilevel / partial pooling – call it what you will).

  3. Hierarchical models can have this flavor to people unfamiliar with them. For example in stan notation:

    CountyAvgDep[c] ~ normal(StateDepLevel[s],CountySigma[state[c]]);

    indivDepr[i] ~ normal(CountyAvgDep[county[i]],IndivSigma[county[i]]);

    Where say StateDepLevel is some data you have from state level agencies, and CountyAvgDep is data you have from county agencies, and indivDepr[i] is the measurements you’ve made on individual patients who live in county[i]….

    Someone else has essentially aggregated individual depression scores for patients at the county level, and someone else further has aggregated county level scores by state… but those scores may or may not have included this patient. If we happen to know that there are not consistent differences between people included vs not included, then exchangeability assumptions tell us that whether or not this individual was included, their outcome is more or less modelable as from the same distribution.

    I think aggregate measurements like this are perfectly reasonable ways to help narrow in on individual values. On the other hand, as soon as you know that ALL of your dataset and ONLY your dataset was included in the aggregate, then you’ve got a problem with exchangeability. So if you generated CountyAvgDep data by taking mean(indivDepr) over the individuals in your dataset that lived in that county, forget it.

    • dab says:

      As someone who is not that familiar with multilevel models, please allow me to ask a (likely stupid) follow-up that is unrelated to the original question. What if, in your example, we were modeling something besides depression, like some measure of suicide risk or maybe alcoholism (I don’t know; I’m not a social scientist). Then, there is no longer the obvious problem of using the thing you’re trying to model as an input. Would it make sense in that case, if you only had individual level depression data, to aggregate that data to the county and state level and use those as separate predictors, along with the individual level depression score? (For example, one might hypothesize that not only an individual’s own level of depression but also the mean level of depression among those in his or her vicinity would be predictive of the response.) Or does the model only make sense if you seek out aggregate data from state and county agencies that may or may not include the individuals in your data set?

      • I think the answer to your question is, it depends on the structure of your model. It seems perfectly reasonable to me to do what you’re talking about in some contexts. In essence the two variables contain different information. The individual information is about the state of mind of the individual, and the aggregated information is about spatial, regional, and cultural aspects of the environment. However, if you’re also including information about say the contacts of your subject then perhaps the aggregated local statistics, and the specific contact information are competing with each other for identifiability of the regional/social issues they each proxy for.

        I’m not a social scientist either, but regardless of the field, I like to think in terms of what information the variables give us. To the extent that two variables give us very similar information, they shouldn’t both be included in the model, to the extent that they give us different information, they can be. To give an extreme example, we shouldn’t use both the age of the subject in years and the age of the subject in days… they give precisely the same information just re-scaled.

        For example, if your aggregate statistic is over say 4 subjects, then it obviously is a lot more strongly affected by the individual data of a given subject than if it’s the aggregate over 4000 subjects.

        • dab says:

          Thanks for your reply. Thinking of the information contained in the variables makes sense to me, and I have done aggregation tricks analogous to the example I gave in fitting plain vanilla regression models when it seemed to make sense from that standpoint. But your talk of “exchangeability” made me wonder if there was something else going on in multilevel models that I should worry about.

          • Exchangeability means that you don’t have information that tells you that some subset of the population is different from another subset in a relevant way. So for example, suppose your aggregate statistics are over a representative sample of the county population, but your data is about strictly immigrants from a specific region of the world, of which there are a relatively small number in your county. Your data on 100 immigrants is most likely not exchangeable with data on 100 randomly selected people from the county.

            In a technical sense, exchangeability is about symmetry under reordering of your data. So if you have 1000 people in some order and you create a generative model using the distribution from the first 900, if the people are all exchangeable you expect the last 100 will look like they came from the same generative model, and this is true regardless of which order you put them in. If they don’t, then you should look for additional explanatory information about them to add to your generative model.


    • “forget it” is perhaps too strong a wording, see discussion below about the role of aggregates in “adding information” to the model for individuals.

  4. Ed Freeman says:

    It’s a bad idea.

    Take this experiment: make a completely random categorical variable, with as many categories as you can stomach in a variable. Do the aggregation and run the models. You’ll usually get significant results. Any time we can get significant results from random data, we’re doing something wrong.

    • fred says:

      Any time we can get significant results from random data, we’re doing something wrong.

      Not any time, only 95% of the time!

  5. Chris Auld says:

    The answer is a big emphatic “no!” at least if the goal is estimate something that can be interpreted as causal (and probably in other contexts, too).

    See Manski (1993) on the “reflection problem,” as this issue is now known in the econometrics literature: