Mister P can solve problems with survey weighting

It’s tough being a blogger who’s expected to respond immediately to topics in his area of expertise.

For example, here’s Scott “fraac” Adams posting on 8 Oct 2016, post titled “Why Does This Happen on My Vacation? (The Trump Tapes).” After some careful reflection, Adams wrote, “My prediction of a 98% chance of Trump winning stays the same.” And that was all before the second presidential debate, which “Trump won bigly. This one wasn’t close.” I don’t know what Trump’s chance of winning is now. Maybe 99%. Or 108%.

That’s fine. When Gregg Easterbrook made silly political prognostications, I was annoyed, because he purported to be a legitimate political writer. Adams has never claimed to be anything but an entertainer and, by golly, he continues to perform well in that regard. So, yes, I laugh at Adams, but I don’t see why he’d mind that. He is a humorist.

What interested me about Adams’s post of 8 Oct was not so much his particular opinions—Adams’s judgments on electoral politics are about as well-founded as my takes on cartooning—but rather his apparent attitude that he had a duty to his readers to share his thoughts, right away. The whole thing had a pleasantly retro feeling; it brought me back to the golden age of blogging, back around 2002 and the “warbloggers” who, whatever their qualifications, expressed such feelings of urgency about each political and military issue as it arose.

Anyway, that’s all background, and I thought of it all only because a similar thing happened to me today.

The real post starts here

Regular readers know that I’ve been taking a break from blogging—wow, it’s been over two months now—except for the occasional topical item that just can’t wait. And today something came that just couldn’t wait.

Several people pointed me to this news article by Nate Cohn with the delightful title, “How One 19-Year-Old Illinois Man Is Distorting National Polling Averages”:

There is a 19-year-old black man in Illinois who has no idea of the role he is playing in this election. . . .

He’s a panelist on the U.S.C. Dornsife/Los Angeles Times Daybreak poll, which has emerged as the biggest polling outlier of the presidential campaign. Despite falling behind by double digits in some national surveys, Mr. Trump has generally led in the U.S.C./LAT poll. . . .

Our Trump-supporting friend in Illinois is a surprisingly big part of the reason. In some polls, he’s weighted as much as 30 times more than the average respondent . . . Alone, he has been enough to put Mr. Trump in double digits of support among black voters. . . .

Cohn gives a solid exposition of how this happens: When you do a survey, the sample won’t quite match the population, and survey researchers use adjustments to adjust for known differences between sample and population. In particular, young black men tend to be underrepresented in surveys, compared to the general population, hence the few respondents in this demographic group need to be correspondingly upweighted. If there’s just one guy in the cell, he might have to get a really big weight, and Cohn identifies this as a key problem in the adjustment, that the survey is using weighting cells that are too small, hence they get very noisy adjustments. In this case, the noise manifests itself as big swings in this USC/LAT poll depending on whether or not this one man is in the sample.

There’s also an issue of adjusting for recalled vote in the previous presidential election but I’ll set that aside for now.

Here’s Cohn on the problems with the big survey weights:

In general, the choice in “trimming” weights [or using coarser weighting cells] is between bias and variance in the results of the poll. If you trim the weights [or use coarser weighting cells], your sample will be biased — it might not include enough of the voters who tend to be underrepresented. If you don’t trim the weights, a few heavily weighted respondents could have the power to sway the survey. . . .

By design, the U.S.C./LAT poll is stuck with the respondents it has. If it had a slightly too Republican sample from the start — and it seems it did, regardless of weighting — there was little it could do about it.

This is fine for what it is, conditional on the assumption that survey researchers are required to only use classical weighting methods. But there is no such requirement! We can now use Mister P.

Here’s a recent article in the International Journal of Forecasting describing how we used MRP for the Xbox poll. Here’s a longer article in the American Journal of Political Science with more technical details. Here’s MRP in the New York Times back in 2009! And here’s MRP in a Nate Cohn article last month in the Times.

Mister P is not magic; of course if your survey has too many Clinton supporters or too many Trump supporters, compared to what you’d expect based on their demographics, then you’ll get the wrong answer. No way around that. But MRP will automatically give the appropriate weight to single observations.

Two issues arise. First, there’s setting up the regression model. The usual plan would be logistic regression with predictors for sex*ethnicity and age*education. We don’t usually see sex*ethnicity*age. This one guy in the survey would influence all these coefficients—but, again, it’s just one survey respondent so the influence shouldn’t be large, especially assuming you use some sort of informative prior to avoid the blow-up you’d get if you had zero African-American Trump supporters in your sample. Second, poststratification. There you’ll need some estimate of the demographic composition of the electorate. But you’d need such an estimate to do weighting, too. I assume the survey organization’s already on top of this one.

So, yeah, we pretty much already know how to handle these problems. That said, there’s some research to be done in easing the transition from classical survey weighting to a modern MRP approach. I addressed some of these challenges in my 2007 paper, Struggles with Survey Weighting and Regression Modeling, but I think a clearer roadmap is needed. We’re working on it.

P.S. Someone forwarded me some comments on a listserv, posted by Arie Kapteyn, Director, USC Dornsife Center for Economic and Social Research:

When designing our USC/LAT poll we have strived for maximal transparency so that indeed anyone who has registered to use our data can verify every step we have taken.

The weights we use to make sure our sample is representative of the U.S. population do result in underrepresented individuals in the sample with a higher weight than those who are in overrepresented groups. In general, one has to make a decision whether to trim weights so that the factor for any individual will not exceed a certain value. However, trimming weights comes with a trade-off, in that it may not be possible to adequately balance the overall sample after trimming. In this poll, we made the decision that we would not trim the weights to ensure that our overall sample would be representative of, for example, young people and African Americans. The result is that a few individuals from groups such as those who are less represented in polling samples and thus have higher weighting factors, can shift the subgroup graphs when they participate. However, they contribute to an unbiased (but possibly noisier) estimate of the outcomes for the overall population.

Our confidence intervals (the grey zones) take into account the effect of weights. So if someone with a big weight participates the confidence interval tends to go up. One can see this very clearly in the graph for African Americans. Essentially, whenever the line for Trump improved, the grey band widened substantially. More generally, the grey band implies a confidence interval of some 30 percentage points so we really should not base any firm conclusion on the changes in the graphs. Admittedly, the weight given to this one individual is very large, nevertheless excluding this individual would move the estimate of the popular vote by less than one percent. Admittedly a lot, but not something that fundamentally changes our forecast. And indeed a movement that falls well within the estimated confidence interval.

So the bottom line is: one should not over-interpret movements if confidence bands are wide.

OK, sure, don’t overinterpret movements if confidence bands are wide, but . . . (1) One concern expressed by Cohn was not just movements but also the estimate itself being consistently too high for the Republican candidate, and (2) With MRP, you can do better! No need to take these horrible noisy estimates and just throw up your hands. Using basic principles of statistics you can get better estimates.

It’s not about trimming the weights or not trimming the weights, it’s about getting a better estimate of national public opinion. The weights—or, more generally, the statistical adjustment—is a means to an end. And you have to keep that end in mind. Don’t get fixated on weighting.

P.P.S. Also, I guess I should also clarify this one point: The classical weighting estimate is not actually unbiased. Kapteyn was incorrect in that claim of unbiasedness.

21 thoughts on “Mister P can solve problems with survey weighting

  1. Andrew,

    I’ve been learning MRP off and on over the past few months, and I’ve been using Kastellec’s paper and R script to get a better sense as to how it works (all available here: http://www.princeton.edu/~jkastell/mrp_primer.html)

    My question is: all examples of MRP that I’ve seen tend to focus on fairly easy binomial 1-dimensional measures (support for gay marriage in the sake of Kastellec’s piece, voting intentions in your xbox paper, etc.). But how would this work for a typical survey questionnaire when you have multiple questions in different formats (multiple choice, single choice, scales, etc.)? Do you run a different model for each data point?

  2. One issue that comes up in practice: with MRP (or, more generally, fitting a model for the outcome as a function of covariates X and then integrating out over the target distribution of X), you need a model for each outcome in your analysis. Rather than just having weights you could use for all outcomes and all recodings of outcomes.

    And for different outcomes the implicit weights on each observation will be different. This is where some of the efficiency gains come from.

      • > many statisticians love having one model that subsumes everything…
        Yup and often oblivious to the assumptions required and trade-offs involved – if it all can be done automatically those (critical) assumptions and trade-offs become submerged.

        The more _information_ one can profitably process, the higher the profits. Miss-processing _information_ reduces the profits and can cause losses.

        (One’s qualifications and skills should define the optimal level of processing to operate at in order to maximize profits.)

    • Dean:

      Yes, I discuss this in my 2007 paper. In the example above this is not an issue because we’re talking about just one outcome. In general, though, the appropriate adjustments depend on what is being studied. For example, for a political outcome you’ll want to adjust for ethnicity, eduction, etc., whereas for a health outcome you might want to adjust very carefully for age (recall Case and Deaton!).

  3. The Arie Kapteyn mentioned is/was a pretty well-known econometrician. I actually expected him to be a pensioner. I guess he is defending other peoples work and probably would not have made zich clumsy predictions were he working on it 20 yrs ago.

    The unbiased thing is an econometricians hobby though. Unbiased, but lousy predictions. In some settings better to drop the BLUE and focus on predictive qualities.

    • Willem:

      I have not before encountered that particular person, but, speaking generally, I’ll just say that the problem with considering the weighted estimate as unbiased is that . . . it’s not unbiased! I guess I should put a note in the above post to that effect.

  4. “Mister P is not magic; of course if your survey has too many Clinton supporters or too many Trump supporters, compared to what you’d expect based on their demographics, then you’ll get the wrong answer.”

    But isn’t this the biggest problem in this panel? They “lucked” into finding that rare African-American Trump supporter?

    • Dl:

      Sure, but that one person had too much influence on the estimate. That’s because they weren’t using the best estimation procedure. The influence of this one person will never be zero but it shouldn’t be as large as it was.

  5. Ah, the old bias-variance tradeoff.

    My general take on this, based on working with these issues for many years, is that if you are working on an issue one time you will tend to err on the side of bias in your choices (increasing the variance, because it impresses clients more and you don’t have to live with your choices long term), but if you have to live with repeated estimates of the phenomenon you will err on the side of a lower variance (i.e. more stable trends, simpler maintenance).

    I no longer work in this area directly, but wondering if there are any handy examples to share of the use of Mr P in time series estimates (e.g. repeated surveys, where the trend is at least as important as the level and the respondents are overlapping but different each wave).

  6. Prof. Gelman,

    I didn’t quite understand this part – “The classical weighting estimate is not actually unbiased”. Can you please elaborate on this? I thought that the classical weighting method was supposed to give an unbiased estimate (assuming that we can use multiple samples).

Leave a Reply to Robin Cancel reply

Your email address will not be published. Required fields are marked *