Last word on Mister P (for now)

To recap:

Matt Buttice and Ben Highton recently published an article where they evaluated multilevel regression and poststratification (MRP) on a bunch of political examples estimating state-level attitudes.

My Columbia colleagues Jeff Lax, Justin Phillips, and Yair Ghitza added some discussion, giving a bunch of practical tips and pointing to some problems with Buttice and Highton’s evaluations.

Buttice and Highton replied, emphasizing the difficulties of comparing methods in the absence of a known ground truth.

And Jeff Lax added the following comment, which I think is a good overview of the discussion so far:

In the back and forth between us all on details, some points may get lost and disagreements overstated. Where are things at this point?

1. Buttice and Highton (BH) show beyond previous work that MRP performance in making state estimates can vary to an extent that is not directly observable unless one knows the true estimates (in which case one would not be using MRP to begin with). Caution is required. We need to be thinking about the situation of an applied researcher with one set of estimates, not huge simulated sets or one with access to truth. We all agree on this. But I think the variation and degree of performance failure are overstated in BH. Phillips and I explained why. Ghitza agreed with us (LP) and went further. BH only explicitly disagreed with Ghitza’s points, I think.

2. BH argue that using the disaggregated survey (CCES) values themselves is a good enough measure of truth to make strong claims about variability and the degree of pessimism one should have. LP agree to an extent. We even calculated “truth” the same way. But BH set aside all noise in their estimates of truth and counted that against MRP performance. Calculating simple disaggregated state means of CCES samples of size 25 to 40K still makes for noisy estimates of truth, particularly for the smaller states. Indeed, truth for too many states is still then calculated using uncomfortably small samples. (For example, would one really trust an estimate of true opinion in DC using 10 or fewer observations?) The reliability of such estimates of “truth” using CCES data ranges from around .3 to .9. Those values are not uncommonly low and variable across questions. This means also that comparisons across questions in BH are called into doubt, since the reliability adjustment ALSO varies across question. Adjusting for reliability, MRP performance is significantly higher and variability lower than reported in BH. BH did not refute this or even disagree with this.

3. Beyond this, one also has noisy estimates of the poststratification weights used for MRP (Ghitza’s point). Even setting aside this point (more on it later), when one adjusts for the demonstrated degree of reliability in the estimate of truth, MRP performance is higher than BH report (by a third) and variation reduced.

4. That all said, there is still variation. Even giving our corrections to BH, there is still variation. BH are right to point out that an applied researcher is still left not knowing how well he or she is doing. As BH point out, this is the very problem at which our LP paper is targeted, showing benchmarks and diagnostics for Mister P…. Dr. P, if you will. We may ultimately be unsuccessful in this, as BH say they were, but we do already have some findings that are of use to practical researchers and will present more soon.

5. Another reason BH overstate MRP performance variability and negative performance is that they implicitly assume a researcher just blindly runs MRP without looking at either the model or the estimates to see if they make sense, etc. We do the same in that we do run large simulations, so we’re not criticizing them, just pointing out that ‘our MRP scheme is meant for only careful researchers’ (to paraphrase the Chevalier de Borda).

6. Ghitza further argues—and this is separate from our reliability correction—that the poststratification weights are also only sampled with error and that one can use MRP on the large samples to get better measures of truth, etc. We sympathize with BH for not using MRP to do better in measuring truth, since that would seem weird to those not already comfortable with MRP. BH defend their choices on the grounds that it is hard to do better, they were clear about what they did, LP do the same, and sufficient variation remains to be cautious. We agree with all that. It is hard to do better. They were clear. We did the same. Variation remains. However, any claims about the degree of variation and degree of performance failure need to be clearly labeled as inflated estimates thereof, or upper bounds perhaps, given that the noisiness of “truth” is being set aside. Or one can do what we now do and adjust for reliability using standard methods for such. One can also calculate uncertainty around model estimates, as we have done in our forthcoming QJPS paper with John Kastellec and Mike Malecki.

7. We are running additional simulations to show whether using standard Census data for poststratification (rather than CCES sample weights) leads to better MRP performance even for CCES-based estimates. That is, even if one doesn’t use MRP to determine truth as Ghitza suggests, one could use Census weights on the MRP estimates, as is usually done. We’ll report back on that.

8. We agree a good state level predictor is important. BH are skeptical about finding one. We are generally not. One need not have access to true opinion to have a reasonable predictor of state-level opinion or state random effects in MRP’s multilevel modeling stage.

Here’s Jeff’s summary:

MRP performance is better and less variable than Buttice and Highton reported in their paper, but they are absolutely right that variation remains and that performance is imperfect. We hope that no one forgets that MRP estimates are still only estimates of true state opinion and that one must do MRP with attention and care (the newly revised MRP package and our forthcoming diagnostics should help with this).

I appreciate the efforts of Lax, Phillips, Buttice, Highton, and Ghitza in this discussion.

P.S. Buttice and Highton respond here.

1 thought on “Last word on Mister P (for now)

  1. I find this emphasis on the need for researchers to be “careful” a bit unsatisfying. What does “careful” mean exactly? Could one simulate careful selection of state level predictors from imperfect information and estimate the resulting variation of Mister P estimates? Instead of choosing a predictor uniformly at random, place moderately higher probabilities on predictors you know to be good to simulate the beneficial impact of the subject knowledge of a careful researcher. Or maybe even include a data-based criterion for predictor selection in the simulations.

Comments are closed.