Skip to content

“The most important aspect of a statistical analysis is not what you do with the data, it’s what data you use” (survey adjustment edition)

Dean Eckles pointed me to this recent report by Andrew Mercer, Arnold Lau, and Courtney Kennedy of the Pew Research Center, titled, “For Weighting Online Opt-In Samples, What Matters Most? The right variables make a big difference for accuracy. Complex statistical methods, not so much.”

I like most of what they write, but I think some clarification is needed to explain why it is that complex statistical methods (notably Mister P) can make a big difference for survey accuracy. Briefly: complex statistical methods can allow you to adjust for more variables. It’s not that the complex methods alone solve the problem, it’s that, with the complex methods, you can include more data in your analysis (specifically, population data to be used in poststratification). It’s the sample and population data that do the job for you; the complex model is just there to gently hold the different data sources in position so you can line them up.

In more detail:

I agree with the general message: “The right variables make a big difference for accuracy. Complex statistical methods, not so much.” This is similar to something Hal Stern told me once: the most important aspect of a statistical analysis is not what you do with the data, it’s what data you use. (I can’t remember when I first heard this; it was decades ago, but here’s a version from 2013.) I add, though, that better statistical methods can do better to the extent that they allow us to incorporate more information. For example, multilevel modeling, with its partial pooling, allows us to control for more variables in survey adjustment.

So I was surprised they didn’t consider multilevel regression and poststratification (MRP) in their comparisons in that report. The methods they chose seem limited in that they don’t do regularization, hence they’re limited in how much poststratification information they can include without getting too noisy. Mercer et al. write, “choosing the right variables for weighting is more important than choosing the right statistical method.” Ideally, though, one would not have to “choose the right variables” but would instead include all relevant variables (or, at least, whatever can be conveniently managed), using multilevel modeling to stabilize the inference.

They talk about raking performing well, but raking involves its own choices and tradeoffs: in particular, simple raking involves tough choices about what interactions to rake on. Again, MRP can do better here because of partial pooling. In simple raking, you’re left with the uncomfortable choice of raking only on margins and realizing you’re missing key interactions, or raking on lots of interactions and getting hopelessly noisy weights, as discussed in this 2007 paper on struggles with survey weighting.

Also, I’m glad that Mercer et al. pointed out that increasing the sample size reduces variance but does nothing for bias. That’s a point that’s so obvious from a statistical perspective but is misunderstood by so many people. I’ve seen this in discussions of psychology research, where outsiders recommend increasing N as if it is some sort of panacea. Increasing N can get you statistical significance, but who cares if all you’re measuring is a big fat bias? So thanks for pointing that out.

But my main point is that I think it makes sense to be moving to methods such as MRP that allow adjustment for more variables. This is, I believe, completely consistent with their main point as indicated in the title of their report.

One more thing: The title begins, “For Weighting Online Opt-In Samples…”. I think these issues are increasingly important in all surveys, including surveys conducted face to face, by telephone, or by mail. Nonresponse rates are huge, and differential nonresponse is a huge issue (see here). The point is not just that there’s nonresponse bias, but that this bias varies over time, and it can fool people. I fear that focusing the report on “online opt-in samples” may give people a false sense of security about other modes of data collection.

I sent the above comments to Andrew Mercer, who replied:

I’d like to do more with MRP, but the problem that we run into is the need to have a model for every outcome variable. For Pew and similar organizations, a typical survey has somewhere in the neighborhood of 90 questions that all need to be analyzed together, and that makes the usual approach to MRP impractical. That said, I have recently come across this paper by Yajuan Si, yourself, and others about creating calibration weights using MRP. This seems very promising and not especially complicated to implement, and I plan to test it out.

I don’t think this is such a problem among statisticians, but among many practicing pollsters and quite a few political scientists I have noticed two strains of thought when it comes to opt-in surveys in particular. One is that you can get an enormous sample and that makes up for whatever other problems may exist in the data and permit you to look at tiny subgroups. It’s the survey equivalent of saying the food is terrible but at least they give you large portions. As you say, this is obviously wrong but not well understood, so we wanted to address that.

The second is the idea that some statistical method will solve all your problems, particularly matching and MRP, and, to a lesser extent, various machine learning algorithms. Personally, I have spoken to quite a few researchers who read the X-box survey paper and took away the wrong lesson, which is that MRP can fix even the most unrepresentative data. But it’s not just, MRP. It was MRP plus powerful covariates like party, plus a population frame that had the population distribution for those covariates. The success of surveys like the CCES that use sample matching leads to similar perceptions about sample matching. Again, it’s not the matching, it’s matching and all the other things. The issue of interactions is important, and we tried to get at that using random forests as the basis for matching and propensity weighting, and we did find that there was some extra utility there, but only marginally more than raking on the two-way interactions of the covariates, and no improvement at all when the covariates were just demographics. And the inclusion of additional covariates really didn’t add much in the way of additional variance.

You’re absolutely right that these same underlying principles apply to probability-based surveys, although I would not necessarily expect the empirical findings to match. They have very different data generating processes, especially by mode, different confounds, and different problems. In this case, our focus was on the specific issues that we’ve seen with online opt-in surveys. The fact that there’s no metaphysical difference between probability-based and nonprobability surveys is something that I’ve written about elsewhere (e.g., and we’ve got lots more research in the pipeline focused on both probability and nonprobability samples.

I think we’re in agreement. Fancy methods can work because they make use of more data. But then you need to include that “more data”; the methods don’t work on their own.

To put it another way: the existence of high-quality random-digit-dialing telephone surveys does not mean that a sloppy telephone survey will do a good job, even if it happens to use random digit dialing. Conversely, the existence of high-quality MRP adjustment in some surveys does not mean that a sloppy MRP adjustment will do a good job.

A good statistical method gives you the conditions to get a good inference—if you put in the work of data collection, modeling, inference, and checking.


  1. ghenly says:

    Is there a “not” missing in your penultimate sentence?

  2. Michael D Maltz says:

    Here’s a longish response to focusing on the data (perhaps more on going beyond the given data), pulled from an intro to a forthcoming book on doing ethnography in criminology:

    Three examples provide context to my strong belief in the need for a qualitative orientation. The first was my initial experience while consulting on police communications systems for the Boston Police Department from 1966-69. To satisfy by curiosity about the ways of the police, I requested, and was granted, permission to conduct an “experiment” on police patrol. The number of patrol cars in one police district was doubled for a few weeks to see if it had any effect on crime. And it did: compared to the “control” district, which had no arrests, the “experimental” district had six arrests. Moreover, there were no arrests at all for the same time period in either district the previous area, so I could calculate that p = 0.016, much less than .05. What a finding! Police patrol really works!

    On debriefing one of the arresting officers, one of the first lessons I learned was that police officers are not fungible. There are no extra police officers hanging around the station that can be assigned to the experimental district: they have to be drawn from somewhere else. The additional officers, who made all of the arrests, were from the BPD’s Tactical Patrol Force – the Marines of the department – who were normally assigned to deal with known trouble spots, and the two districts selected for the study were generally low-crime areas.

    In fact, the TPF officers already knew that a gang of car thieves/strippers was active in the experimental district and decided to take them out, which resulted in all of the arrests they made. They couldn’t wait to get back to working citywide, going after real crime, but took the opportunity to clean up what they considered to be a minor problem. So after that experience I realized that you have to get under the numbers to see how they are generated, or as I used to explain to students, to “smell” the data.

    Another example: Some years ago I was asked to be an expert (plaintiff’s) witness in a case in the Chicago suburbs, in which the defendant suburb’s police department was accused of targeting Latino drivers for DUI arrests to fill their arrest quotas. My job was to look at the statistical evidence prepared by another statistician (the suburb’s expert witness) and evaluate its merits. I was able to show that there were no merits to the analysis (the data set was hopelessly corrupted), and the case was settled before I had a chance to testify.

    What struck me after the settlement, however, was the geography and timing of the arrests. Most of them occurred on weekend nights on the road between the bars where most of the Latinos went to drink and the areas where they lived. None were located on the roads near the Elks or Lions clubs, where the “good people” bent their elbows.

    I blame myself on not seeing this immediately, but it helped me to see the necessity in going beyond the given data and looking for other clues and cues that motivate those actions that are officially recorded. While it may not be as necessary in some fields of study, in criminology it certainly is.

    A third example was actually experienced by my wife, who carried out a long-term ethnographic study of Mexican families in Chicago (Farr, 2006) all of whom came from a small village in Michoacán, Mexico. Numerous studies, primarily based on surveys, had concluded that these people were by and large not literate. One Saturday morning in the early 1990s, she was in one of their homes when various children began to arrive, along with two high school students. One of the students then announced (in Spanish, of course), “OK, let’s get to work on the doctrina (catechism),” and slid open the doors on the side of the coffee table, revealing workbooks and pencils, which she distributed to the kids.

    On another occasion, my wife was drinking coffee in the kitchen when all of the women (mothers and daughters) suddenly gathered at the entrance to the kitchen as someone arrived with a plastic supermarket bag full of something– -which turned out to be religious books (in Spanish) on topics such as Getting Engaged and After the Children Come Along. Each woman eagerly picked out a book, and one of them said, “I am going to read this with my daughter.”

    Clearly these instances indicate that children in the catechism class and the women in the kitchen were literate. The then-current questionnaires that evaluated literacy practices, however, asked questions such as, “Do you subscribe to a newspaper? Do you have a library card? Do you have to read material at work?” In other words, the questionnaires (rightly so) didn’t just ask people outright, “Can you read?” but rather focused on the domains they thought required reading. Yet no questions dealt with religious literacy, since literacy researchers at the time did not include a focus on religion. The result? The literacy practices of these families were “invisible” to research. “

    These anecdotes are but three among many that turned me off the then-current methods of learning about social activity, in these cases via (unexamined) data and (impersonal) questionnaires. Perhaps this has to do with my engineering (rather than scientific) background, since engineers deal with reality and scientists propound theories. To translate to the current topic, it conditioned me to take into consideration the social context, a recognition that context matters and that not all attributes of a situation or person can be seen as quantifiable “variables.” This means, for example, that a crime should be characterized by more than just victim characteristics, offender characteristics, time of day, etc. and that an individual should be characterized by more than just age, race, ethnicity, education, etc. or “so-so” (same-old, same-old) statistics. These require a deeper understanding of the situation, which ethnography is best suited, albeit imperfectly, to do—to put oneself in the position, the mindset, of the persons whose actions are under study.

    • Great examples, thanks for sharing.

      I do think though that “standard” statistical measures can be very useful. Sure, the questions may not address all the issues or be blind to certain aspects (such as religious literacy in your examples) but this doesn’t mean we should stop collecting quantitative data, it more means we should be careful about interpreting it and consider explanations other than what might come from “raw” number crunching. Last year I worked on a similar issue: alcohol usage/abuse. I found all sorts of measurement issues, clearly explainable by misunderstanding the wording of the questions. For example people claiming to drink 30 standard drinks per day on days they drank last month, who clearly meant that they drank 30 total drinks (1/day) in the last month. anyone who drank daily and had 30 drinks a day would die within the first couple of days, even with tolerance build up from being a hard drinking alcoholic (those people drink closer to 12 to 15 per day, no-one can physiologically eliminate 30 drinks of alcohol daily)

      anyway, paying attention to the reality is very important. Your comments about having engineering training resonated with me as well. Messy real-world stuff.

      • Michael D Maltz says:

        I don’t mean to imply that we should stop collecting quantitative data. Rather, we should understand how it is collected — and also consider what important stuff might have been left out of the collection process.

        • Agreed 100%, and the kinds of questions we should ask include things like “what are we actually measuring here?” or “how could the results we have differ from the thing we really wanted?” or “do these two sources of data have the same measurement issues and if not, how do they differ?”

          just never ever ever take for granted that the thing you have in your dataset measures the thing you want to know about directly.

        • I think that researchers should be out and about a lot more than they are. Some large percent seem to hang with their colleagues and specialties. Please please tell me I’m mistaken.

          I guess, more fundamentally, I would welcome experts being ‘personable’ or ‘friendly’. We are honed in a culture of complaint & emotional blackmail. Just listen to the news for a 1/2 hour or so.

          As for this thread. It brings up views raised in the following article.

          Inferential statistics are descriptive statistics

          Amrhein V, Trafimow D, Greenland S. (2018) Inferential statistics are descriptive statistics. PeerJ Preprints 6:e26857v2

    • Martha (Smith) says:

      Thanks to both Michael and Daniel for good examples and discussion.

  3. David Bailey says:

    Isn’t it a bit of an oversimplification to say that “increasing the sample size reduces variance but does nothing for bias”? Shouldn’t “mindlessly” be inserted in front of “increasing”?

    A powerful way to detect biases in multivariate data is to check if different subsamples give inconsistent results, and increasing the sample size allows more sensitive subsample testing and perhaps partial correction of some bias by more complex modelling. Yes, simply increasing the sample size may be wasted effort, but as you note “Fancy methods can work because they make use of more data.”

    • Martha (Smith) says:

      How does checking if different subsamples give inconsistent results detect bias?

      • David Bailey says:

        If bias is due to the characteristics of the sample not matching the population, then if different subsamples have different characteristics, the subsample results may disagree because they are biased differently. When there are such unexplained or unmodelled subsample disagreements, it means that you cannot trust the results from the complete sample.

        A simple physics example would be if you were determining the mass of the electron by making many measurements. The mass of the electron is not expected to change with time, so a classic simple check is to compare measurements taken during the day with those taken at night. (Day/night effects are common due to changes in temperature, humidity, electronic noise, mechanical vibration, ….) If the measurements disagree, it indicates that your measurement model is inadequate and your mass measurement may be biased. The difference between the day and night measurements is not sufficient to identify the source and size of the bias in the total sample, but it tells you there is some effect (i.e. possible) bias you don’t understand.

        Similarly, results of psychological studies were often biased because they were made on non-representive samples (e.g. college students). Even college students are not completely homogeneous, however, so if the samples were divided into male/female, Frosh/Seniors, urban/rural, Science/Humanities, …, these subsample results might disagree and indicate the overall result may be biased. (Of course, what might happen instead are statistically dubious published claims about the subsamples, sigh ….)

        In some sense, meta-analysis is all about looking for subsample differences, where the “subsamples” are each individual study. If all the studies are consistent, one believes them; if they are inconsistent, there is unexplained bias in some or all of the studies.

        • Martha (Smith) says:

          I think I get (more-or-less) what you’re trying to say. But I think the situation is more complicated than you seem to present. What I mean by this is that sample variability is always present, even if we take truly random samples from the population of interest. So detecting differences in estimates from different subsamples won’t necessarily imply bias. (We may be using the word “bias” in different ways — I’m not sure how you’re using it.) Another way of looking at this is what Andrew often says: That effects are variable, not fixed. This is expressed in statistical modeling by equations such as “value = parameter + error” where “error” (somewhat confusingly) refers to the natural variability from case to case.

        • AllanC says:

          What this sounds like to me is that you’re suggesting an increased amount of data helps you identify recognizable subsets. These subsets can then be removed/eliminated as required from the reference set allowing for more accurate inference.

          If I remember my statistics for experimenters correctly, Box believed that is was only the combination of data illuminating potential recognizable subsets and/or outliers, with domain specific knowledge that allowed a researcher to be able to distinguish them with enough confidence to exclude them (as required) from the reference set. Baring domain specific knowledge outliers removed from the reference set can only be done at the researcher’s peril.

          If that is a correct interpretation of what you’re getting at then I am with Martha in the sense that this is not the usual use of the term of bias elimination; at least not in the way I would use it.

          • David Bailey says:

            My apologies for not being clear. One reason I read Andrew’s blog is that I like hearing about statistical issues from a perspective outside my field (physics), but I often struggle because of differences in terminology and background.

            I emphatically agree with you (and Box) that simply removing outlier subsets from your reference set is a very bad idea.

            My point is that if there are unexplained differences in subsamples, then it is risky to trust conclusions based on the whole sample. By “unexplained”, I mean that the differences are too large to be easily explained by random variations (including multiple comparison / look-elsewhere effects) and that they are inconsistent with the model being used to analyse the data. I guess I am saying that cross-validation or multilevel modelling methods usually work better with more data.

            A classic historical example of what I am talking about that is often mentioned in statistics texts are observer biases in astronomy, e.g. see Hampel ( or Jeffreys ( Different observers have systematic biases (the “personal equation”), and as Pearson and Jeffrey’s emphasized, those biases are not constant in time. It is easier to understand such biases if you have more data.

            For a simple example, imagine you have many measurements of a star. You might want to just average them and quote the uncertainty as the standard deviation over root N, but being cautious, you decide to compare measurements made by your different graduate student observers: Alice, Bob, and Charlie. Alice and Bob’s measurement averages agree, but Charlie’s differs by an amount much larger than expected from the observed random variations. This strongly suggests is a systematic effect and you might want to report the difference between the observers as a “systematic uncertainty” (i.e. plausible range of bias). But you worry that you may be seriously underestimating this uncertainty since it is based on a sample size of 3, so under plausible assumptions the associated probability distribution is a 2 degree-of-freedom heavy-tailed Student’s t distribution. In order to try to better understand the effect you decide to look each observer’s measurements as a function of time, and you notice that they all agree except on Thursdays. You might be tempted at this point to just throw out the Thursday data, but without understanding the problem, you (correctly) feel this is would be wrong. You then talk to the graduate students and discover that Thursday is “Half-Price Beer!” night at the Graduate Student Union Pub that Charlie regularly attends. This provides you with the “specific knowledge” that allows you to exclude the Thursday data and still sleep at night. All these checks would not have been possible if you did not have enough data to see the difference between Alice, Bob, and Charlie, or between Thursday and other days.

            Of course, my perspective is heavily influenced by the fact that physics is the easiest science, e.g. unlike people, electrons are cheap, identical, and constant. This means that physicists and astronomers often have well-justified belief that they are measuring a constant effect, so if we see an unexpected subsample difference our first assumption should always be that something is wrong with our measurement model. In contrast, social and life sciences have great challenges from real sample heterogeneity and complex poorly understood multiple interacting effects. It may be much easier to understand a star than it is to understand astronomers, but my main (and I thought uncontroversial) point is that more data – when sufficiently thoughtfully taken – doesn’t just reduce statistical uncertainty but allows us to make stronger tests for problems with the data that may bias the final result.

            • Martha (Smith) says:


              I agree with your statement, “I guess I am saying that cross-validation or multilevel modelling methods usually work better with more data” — especially the “multilevel modeling” part.

              This (plus your statement “I often struggle because of differences in terminology and background” helps clarify (or at least suggest, to me at least) that when you have been talking about “subsamples” you have been thinking of “subsamples” defined by restricting to particular values of some variable (e.g., observer, in the example you gave). Yes, this kind of situation is one reason why multilevel models are important. And before multilevel models were feasible, the idea of “mixed models” including both “fixed” and “random” factors was developed to address this type of problem. In fact, if I remember correctly, mixed model methods first arose in analyzing astronomical observations, where “random” factors such as “day” could affect observational data (e.g., observing conditions on different days influence the data recorded).

              PS The reason I include “especially the multilevel modeling part” is that multilevel modeling isn’t just a method of “detecting” possible differences in “different subsamples” (which is what cross-validation does), but is a method for analyzing data where there is a (typically categorical) variable (such as “day” or “observer” or “apparatus used for observing”) that plausibly may influence observations — and does the analysis in a way that is better than analyzing, e.g., each day’s data individually — by what is sometimes called “drawing strength” from observations across all values of the categorical value.

              (I hope this clarifies some differences in use of terminology.)

  4. Anon says:

    “Personally, I have spoken to quite a few researchers who read the X-box survey paper and took away the wrong lesson, which is that MRP can fix even the most unrepresentative data.”

    I found the report by Mercer and colleagues helpful because I was certainly one of the readers of the X-box survey paper that drew this conclusion!

Leave a Reply