More on Mister P and how it does what it does

Following up on our discussion the other day, Matt Buttice and Ben Highton write:

It was nice to see our article mentioned and discussed by Andrew, Jeff Lax, Justin Phillips, and Yair Ghitza on Andrew’s blog in this post on Wednesday. As noted in the post, we recently published an article in Political Analysis on how well multilevel regression and poststratification (MRP) performs at producing estimates of state opinion with conventional national surveys where N≈1,500. Our central claims are that (i) the performance of MRP is highly variable, (ii) in the absence of knowing the true values, it is difficult to determine the quality of the MRP estimates produced on the basis of a single national sample, and, (iii) therefore, our views about the usefulness of MRP in instances where a researcher has a single sample of N≈1,500 are less optimistic than the ones expressed in previous research on the topic.

Obviously we were interested in the blog posts. We found them stimulating and have begun reflecting on them. It seems that there are some broad areas of agreement regarding the use of MRP with conventional national survey samples. And, there are areas of disagreement. In response to the posts, we have several initial thoughts that may be of interest to those contemplating the use of MRP in their own research.

1. The question that motivated our article was this: With a national sample of 1,500 respondents can a researcher be confident that MRP will produce “good” state-level estimates of opinion? If we have read the blog posts correctly, then there is agreement among us all that a researcher should not assume that MRP will perform well and instead the MRP estimates ought to be validated. This leads to a number of additional questions. One is: In the absence of knowing true state opinion, is it possible to validate and assess the quality of the MRP estimates? In our own work as part of the research for the article, we tested a variety of possibilities and were unsuccessful at coming up with a method that consistently worked to assess the MRP estimates.

2. Another question that arises is: What conditions must be met for MRP to perform well? As Andrew writes in his blog, “Good state-level predictors are crucial if you’re using Mister P [MRP] to get estimates in all the states.” We agree – as do Lax and Phillips (LP) – that good state-level predictors are a necessary condition. We would add that there is no reason to believe that what is “good” for one model will be “good” for another. This is especially the case when the opinions in question are in different issue domains or from different time periods. (Our Monte Carlo simulations, discussed in the article, also suggest that the relationship between interstate variation and intrastate variation in opinion is also important for MRP performance.)

3. How does the researcher know if he/she has good state-level predictors? As in #1 above, it seems that we are all in agreement that the researcher should not assume that he/she has good state-level predictors. If one knew true opinion across the states, then it would be easy to assess the quality of the state-level predictors. But, if the researcher knew true opinion across the states, then the researcher would not be trying to estimate it by applying MRP to a sample of 1,500 respondents. So, with only a sample of 1,500 respondents – the situation that applied researchers will find themselves in – is there a valid and reliable way to determine if one has good state-level predictors? We attempted to find such a test, but as in #1 above, we were unsuccessful, and therefore did not report one in our article.

4. Assessing the quality of the state-level predictors would be less important if the state-level predictors were only rarely “not good” or if MRP (for whatever reason) only rarely produced low quality estimates. Here is where we think disagreement emerges between our views and those expressed in the blog posts. When we look at the results reported in our article, we see substantial variability across all the measures assessing MRP performance. Previous analyses of MRP do not show how its performance varies across different samples drawn from the same population to estimate the same opinion. But as we show in the figures in our article, variability is evident when we look across different samples for the same opinion and across opinions. To be sure, Yair Ghitza suggests that we overestimate the amount variability. If one agrees with Ghitza and his assertion that measuring “true” preferences with MRP does not lead to overestimating the performance and consistency of MRP with small samples, then there is less actual variation in MRP performance than the amount we report in our article. But, less variation does not mean minimal variation, as shown in Ghitza’s figure “MRP as ‘True’ Value, Census as Population.” Put another way, based on the results reported in Ghitza’s figure, could a researcher who has produced state opinion estimates with MRP from a sample of 1,500 respondents be confident that MRP has performed well? We do not think so, but realize others might think differently.

5. In questioning our results, Ghitza suggests that for two reasons we overestimate the amount of variability in MRP performance. This is where we think that resolving our differences is more difficult, and we will not attempt to do so here. Instead, we refer interested readers to footnote 14 in our article and here will only try to be clear about what we did. We computed “true” state means based on the full 25,000+ samples and also conducted a “census” based on the full sample to produce the stratification weights. After drawing a sample of 1,500 and estimating the multilevel model of opinion on it, we produced state estimates based on the predictions from the multilevel model, weighted by the “census” weights. For what it’s worth, this appears to be what LP did in their MPSA paper: “The next stage is poststratification, in which our estimates for each respondent demographic-geographic type must be weighted by the percentages of each type in the state population. Again, we assume the state population to be the set of CCES survey respondents from that state” (Lax & Phillips, MPSA 2013, p. 17).

6. Last, we return to our point #1 above. Suppose our analyses and results are deeply flawed and deserve to be disregarded completely, a supposition that we recognize may reflect the views of some readers. Consider the question that motivated our article: How confident should a researcher who only has a single national survey sample of 1,500 (or even 3,000) respondents be in the MRP estimates of state opinion produced with it? Setting aside our article, the only other published studies that assess MRP performance with samples like these are Lax and Phillips (2009) and Warshaw and Rodden (2012). The former assess MRP performance for two opinions and the latter do it for six. And, Warshaw and Rodden (2012) do find what we would call nontrivial variation in average MRP performance across items (look at the MRP entries for the six opinion items when N=2,500 in their Figure 5; we highlighted the relevant entries.) On the basis of Lax and Phillips (2009) and Warshaw and Rodden (2012), then, we would not draw the inference that MRP will consistently and routinely perform well across different opinions or even for the same opinion at different points in time. And, even when MRP has worked well, we are unsure how the researcher can verify its performance. The investigations of MRP performance that preceded ours – and ours, too – all assess the quality of the estimates by comparing them to “true” values. In the absence of knowing the true values we do not see how the researcher could determine how “good” or “bad” the MRP estimates are, and we would therefore hesitate to use them. That said, developing a validation technique for MRP estimates when “true” values are unknown appears to be an issue that LP are working on, and we look forward to reading the next version of their MPSA paper.

I love seeing all this discussion. As regular readers will know, I think Mister P is great, and I’m glad to see a range of applied researchers thinking hard about what makes it work and how to improve it.

7 thoughts on “More on Mister P and how it does what it does

  1. In the back and forth between us all on details, some points may get lost and disagreements overstated. Where are things at this point?

    1. Buttice and Highton (BH) show beyond previous work that MRP performance in making state estimates can vary to an extent that is not directly observable unless one knows the true estimates (in which case one would not be using MRP to begin with). Caution is required. We need to be thinking about the situation of an applied researcher with one set of estimates, not huge simulated sets or one with access to truth. We all agree on this. But I think the variation and degree of performance failure are overstated in BH. Phillips and I explained why. Ghitza agreed with us (LP) and went further. BH only explicitly disagreed with Ghitza’s points, I think.

    2. BH argue that using the disaggregated survey (CCES) values themselves is a good enough measure of truth to make strong claims about variability and the degree of pessimism one should have. LP agree to an extent. We even calculated “truth” the same way. But BH set aside all noise in their estimates of truth and counted that against MRP performance. Calculating simple disaggregated state means of CCES samples of size 25 to 40K still makes for noisy estimates of truth, particularly for the smaller states. Indeed, truth for too many states is still then calculated using uncomfortably small samples. (For example, would one really trust an estimate of true opinion in DC using 10 or fewer observations?) The reliability of such estimates of “truth” using CCES data ranges from around .3 to .9. Those values are not uncommonly low and variable across questions. This means also that comparisons across questions in BH are called into doubt, since the reliability adjustment ALSO varies across question. Adjusting for reliability, MRP performance is significantly higher and variability lower than reported in BH. BH did not refute this or even disagree with this.

    3. Beyond this, one also has noisy estimates of the poststratification weights used for MRP (Ghitza’s point). Even setting aside this point (more on it later), when one adjusts for the demonstrated degree of reliability in the estimate of truth, MRP performance is higher than BH report (by a third) and variation reduced.

    4. That all said, there is still variation. Even giving our corrections to BH, there is still variation. BH are right to point out that an applied researcher is still left not knowing how well he or she is doing. As BH point out, this is the very problem at which our LP paper is targeted, showing benchmarks and diagnostics for Mister P…. Dr. P, if you will. We may ultimately be unsuccessful in this, as BH say they were, but we do already have some findings that are of use to practical researchers and will present more soon.

    5. Another reason BH overstate MRP performance variability and negative performance is that they implicitly assume a researcher just blindly runs MRP without looking at either the model or the estimates to see if they make sense, etc. We do the same in that we do run large simulations, so we’re not criticizing them, just pointing out that ‘our MRP scheme is meant for only careful researchers’ (to paraphrase the Chevalier de Borda).

    6. Ghitza further argues—and this is separate from our reliability correction—that the poststratification weights are also only sampled with error and that one can use MRP on the large samples to get better measures of truth, etc. We sympathize with BH for not using MRP to do better in measuring truth, since that would seem weird to those not already comfortable with MRP. BH defend their choices on the grounds that it is hard to do better, they were clear about what they did, LP do the same, and sufficient variation remains to be cautious. We agree with all that. It is hard to do better. They were clear. We did the same. Variation remains. However, any claims about the degree of variation and degree of performance failure need to be clearly labeled as inflated estimates thereof, or upper bounds perhaps, given that the noisiness of “truth” is being set aside. Or one can do what we now do and adjust for reliability using standard methods for such. One can also calculate uncertainty around model estimates, as we have done in our forthcoming QJPS paper with John Kastellec and Mike Malecki.

    7. We are running additional simulations to show whether using standard Census data for poststratification (rather than CCES sample weights) leads to better MRP performance even for CCES-based estimates. That is, even if one doesn’t use MRP to determine truth as Ghitza suggests, one could use Census weights on the MRP estimates, as is usually done. We’ll report back on that.

    8. We agree a good state level predictor is important. BH are skeptical about finding one. We are generally not. One need not have access to true opinion to have a reasonable predictor of state-level opinion or state random effects in MRP’s multilevel modeling stage.

    To sum up, MRP performance is better and less variable than BH reported in their paper, but they are absolutely right that variation remains and that performance is imperfect. We hope that no one forgets that MRP estimates are still only estimates of true state opinion and that one must do MRP with attention and care (the newly revised MRP package and our forthcoming diagnostics should help with this).

  2. (Lax) “Another reason BH overstate MRP performance variability and negative performance is that they implicitly assume a researcher just blindly runs MRP without looking at either the model or the estimates to see if they make sense, etc. We do the same in that we do run large simulations, so we’re not criticizing them, just pointing out that ‘our MRP scheme is meant for only careful researchers’”

    Of course, researchers get less careful as a technique gets into broader use.

    Thanks for making this blog a central place for this important dialogue.

  3. Pingback: Being Careful with Multilevel Regression with Poststratification | The Political Methodologist

  4. Minor point, but since it was mentioned above: Our 2009 AJPS paper assessed MRP for 6 survey items, not just 2 (see p119).

Comments are closed.