Following up on our discussion the other day, Matt Buttice and Ben Highton write:
It was nice to see our article mentioned and discussed by Andrew, Jeff Lax, Justin Phillips, and Yair Ghitza on Andrew’s blog in this post on Wednesday. As noted in the post, we recently published an article in Political Analysis on how well multilevel regression and poststratification (MRP) performs at producing estimates of state opinion with conventional national surveys where N≈1,500. Our central claims are that (i) the performance of MRP is highly variable, (ii) in the absence of knowing the true values, it is difficult to determine the quality of the MRP estimates produced on the basis of a single national sample, and, (iii) therefore, our views about the usefulness of MRP in instances where a researcher has a single sample of N≈1,500 are less optimistic than the ones expressed in previous research on the topic.
Obviously we were interested in the blog posts. We found them stimulating and have begun reflecting on them. It seems that there are some broad areas of agreement regarding the use of MRP with conventional national survey samples. And, there are areas of disagreement. In response to the posts, we have several initial thoughts that may be of interest to those contemplating the use of MRP in their own research.
1. The question that motivated our article was this: With a national sample of 1,500 respondents can a researcher be confident that MRP will produce “good” state-level estimates of opinion? If we have read the blog posts correctly, then there is agreement among us all that a researcher should not assume that MRP will perform well and instead the MRP estimates ought to be validated. This leads to a number of additional questions. One is: In the absence of knowing true state opinion, is it possible to validate and assess the quality of the MRP estimates? In our own work as part of the research for the article, we tested a variety of possibilities and were unsuccessful at coming up with a method that consistently worked to assess the MRP estimates.
2. Another question that arises is: What conditions must be met for MRP to perform well? As Andrew writes in his blog, “Good state-level predictors are crucial if you’re using Mister P [MRP] to get estimates in all the states.” We agree – as do Lax and Phillips (LP) – that good state-level predictors are a necessary condition. We would add that there is no reason to believe that what is “good” for one model will be “good” for another. This is especially the case when the opinions in question are in different issue domains or from different time periods. (Our Monte Carlo simulations, discussed in the article, also suggest that the relationship between interstate variation and intrastate variation in opinion is also important for MRP performance.)
3. How does the researcher know if he/she has good state-level predictors? As in #1 above, it seems that we are all in agreement that the researcher should not assume that he/she has good state-level predictors. If one knew true opinion across the states, then it would be easy to assess the quality of the state-level predictors. But, if the researcher knew true opinion across the states, then the researcher would not be trying to estimate it by applying MRP to a sample of 1,500 respondents. So, with only a sample of 1,500 respondents – the situation that applied researchers will find themselves in – is there a valid and reliable way to determine if one has good state-level predictors? We attempted to find such a test, but as in #1 above, we were unsuccessful, and therefore did not report one in our article.
4. Assessing the quality of the state-level predictors would be less important if the state-level predictors were only rarely “not good” or if MRP (for whatever reason) only rarely produced low quality estimates. Here is where we think disagreement emerges between our views and those expressed in the blog posts. When we look at the results reported in our article, we see substantial variability across all the measures assessing MRP performance. Previous analyses of MRP do not show how its performance varies across different samples drawn from the same population to estimate the same opinion. But as we show in the figures in our article, variability is evident when we look across different samples for the same opinion and across opinions. To be sure, Yair Ghitza suggests that we overestimate the amount variability. If one agrees with Ghitza and his assertion that measuring “true” preferences with MRP does not lead to overestimating the performance and consistency of MRP with small samples, then there is less actual variation in MRP performance than the amount we report in our article. But, less variation does not mean minimal variation, as shown in Ghitza’s figure “MRP as ‘True’ Value, Census as Population.” Put another way, based on the results reported in Ghitza’s figure, could a researcher who has produced state opinion estimates with MRP from a sample of 1,500 respondents be confident that MRP has performed well? We do not think so, but realize others might think differently.
5. In questioning our results, Ghitza suggests that for two reasons we overestimate the amount of variability in MRP performance. This is where we think that resolving our differences is more difficult, and we will not attempt to do so here. Instead, we refer interested readers to footnote 14 in our article and here will only try to be clear about what we did. We computed “true” state means based on the full 25,000+ samples and also conducted a “census” based on the full sample to produce the stratification weights. After drawing a sample of 1,500 and estimating the multilevel model of opinion on it, we produced state estimates based on the predictions from the multilevel model, weighted by the “census” weights. For what it’s worth, this appears to be what LP did in their MPSA paper: “The next stage is poststratification, in which our estimates for each respondent demographic-geographic type must be weighted by the percentages of each type in the state population. Again, we assume the state population to be the set of CCES survey respondents from that state” (Lax & Phillips, MPSA 2013, p. 17).
6. Last, we return to our point #1 above. Suppose our analyses and results are deeply flawed and deserve to be disregarded completely, a supposition that we recognize may reflect the views of some readers. Consider the question that motivated our article: How confident should a researcher who only has a single national survey sample of 1,500 (or even 3,000) respondents be in the MRP estimates of state opinion produced with it? Setting aside our article, the only other published studies that assess MRP performance with samples like these are Lax and Phillips (2009) and Warshaw and Rodden (2012). The former assess MRP performance for two opinions and the latter do it for six. And, Warshaw and Rodden (2012) do find what we would call nontrivial variation in average MRP performance across items (look at the MRP entries for the six opinion items when N=2,500 in their Figure 5; we highlighted the relevant entries.) On the basis of Lax and Phillips (2009) and Warshaw and Rodden (2012), then, we would not draw the inference that MRP will consistently and routinely perform well across different opinions or even for the same opinion at different points in time. And, even when MRP has worked well, we are unsure how the researcher can verify its performance. The investigations of MRP performance that preceded ours – and ours, too – all assess the quality of the estimates by comparing them to “true” values. In the absence of knowing the true values we do not see how the researcher could determine how “good” or “bad” the MRP estimates are, and we would therefore hesitate to use them. That said, developing a validation technique for MRP estimates when “true” values are unknown appears to be an issue that LP are working on, and we look forward to reading the next version of their MPSA paper.
I love seeing all this discussion. As regular readers will know, I think Mister P is great, and I’m glad to see a range of applied researchers thinking hard about what makes it work and how to improve it.