Answering a question about MRP with non-census variables

Eric Green writes:

After reading about Mister P on your blog for some time, I decided to learn the basics to tackle a few projects that involved non-representative samples. I wrote up a teaching example here that I hope is free of fatal flaws.

At this point I get the basic application of Mister P to adjust non-representative data to characterize the target population. But I’m wondering about how to proceed when the analytic goal shifts from characterizing the population to estimating associations.

For instance, in your recent paper, “Bayesian hierarchical weighting adjustment and survey inference”, you set up an example where the population data and the survey data both include the variables age, eth, edu and inc. The survey data also has outcome Y, and you proceed with an MRP example.

Here’s where I’m confused: What if your goal was to estimate the association between Y and another variable M in your survey dataset? Something like:

Y ~ 1 + M + (1 | age) + (1 | eth) + (1 | edu) + (1 | inc)

Without M in the population data, we can’t use the model fit to predict Y on the population summary data, and therefore can’t post-stratify.

This is going to be a dumb question, but if you wanted to look at Y ~ M there is no MRP equivalent to adjusting the estimates, right? I think you would just examine the association in the raw survey dataset (like all studies with convenience samples do) and then consider threats to external validity.

There are two challenges here.

1. If you want to adjust for a non-census variable such as M, you need somehow to estimate the distribution of M in the population. Actually, you need to estimate the joint distribution of (M, X), where X represents all the census variables you’re poststratifying on. You already have the distribution of X, so what you need is the distribution of M given X. You can do this with the survey you have (as Yu-Sung and I did in this project) or using other information. Ideally your analysis would account for this uncertainty. Alternatively, you can think of the problem as estimating the joint distribution of Y and M, given X, where you’re parameterizing this as M given X, and then Y given M and X.

2. If you just want the distribution of Y given M, then you’ll want to average over X. What this means is you’ll estimate the distribution of Y given M and X, and then poststratify on X. The difficulty here is that you’ll lose linearity (or whatever functional form you were assuming). If Y given M and X is a linear regression, and then you average over the distribution of X, the resulting E(Y|M) will not, in general, be a linear function of M. That’s just the way it goes. You can see this by sketching a scatterplot.

6 thoughts on “Answering a question about MRP with non-census variables

  1. > After reading about Mister P on your blog for some time, I decided to learn the basics

    I’m in exactly the same place. Thanks for publishing a nice tutorial, it was very helpfull (I wouldn’t know whether it contains fatal flaws though, none jumped out at me).

  2. I think the question posed to Andrew is very reasonable and comes up all the time. Weird that there’s no easy examples or tutorials for estimating a relationship between Y and M while accounting for known differences between a sample and the population.

    • Anon:

      Yeah, I’ve been thinking a lot about this problem, and it’s really hard. I think a big part of the difficulty is that it’s not completely clear what is meant by “accounting for known differences between a sample and the population.” You can have survey weights, and that’s fine, but survey weights don’t directly correspond to a sampling model. It’s a surprisingly tricky problem that I’m still struggling with.

  3. I have been dealing with this issue recently as well, so thanks a bunch for bringing it up. I was glad to see that I seem to be thinking about it the correct way. My motivation for including M was actually that I could not get a reasonable generating model for Y without first conditioning on M. More specifically, dispersion in Y was poorly approximated by a lognormal (or any other typical distribution choice) without including the predictor M (plus survey variables). The poor fit was quite clear in the posterior predictive checks. The textbook MRP prescription of conditioning on the survey variables alone didn’t seem to work so well in this case. It made me think about what the implications might have been had I just used the survey weights in a designed based analysis? Whatever the case, I suspect there are numerous examples like this in practice?

    Maybe for better clarity, my brms model for implementing (1) in the original post looked something like:

    bf_y <- bf(y ~ M + survey_vars)
    bf_M <- bf(M ~ survey_vars)
    model <- brm(bf_y + bf_M + set_rescor(rescor = FALSE)) # residual correlation not modeled because y conditioned on M

    To poststratify, I have to predict M given survey, then predict y using predicted M over survey. So I treated it something like a SEM and for brms I found some helpful discussion here about propagating error: https://github.com/paul-buerkner/brms/issues/303

Leave a Reply

Your email address will not be published. Required fields are marked *