She wants to know what are best practices on flagging bad responses and cleaning survey data and detecting bad responses. Any suggestions from the tidyverse or crunch.io?

A colleague who works in a field that uses a lot of survey research asks:

Can you recommend papers about detecting bad survey responses? We have some such methods where I work, but I’m curious what the Census Bureau and other big survey establishments do to flag bad responses. The Groves book doesn’t seem to have much:

My colleague continues:

I’ve looked through documentation and mostly see things they do during data collection to get good responses, which is of course great. I’m curious what’s their process during data cleaning. Are they looking for outliers relative to what someone’s other responses would predict?

I replied that this sounds like something that would be done in the tidyverse, so maybe someone from that world can offer some suggestions? Also I know that the people at the Yougov spinoff crunch.io do lots of data cleaning, so maybe they have some document they can point to.

In my own work, I’ve had to clean data—there’s an example in Appendix A of Regression and Other Stories—but I don’t have a systematic workflow for the process.

I remember when we were analyzing a How many X’s do you know survey, we somehow stumbled across one respondent who answered 7 to every question, so we threw that person out of our data, but that’s just something we happened to notice, and there could well be other bad responses there that we hadn’t noticed.

At the other extreme, what if you find yourself with possibly fabricated data, perhaps the work of some researcher such as Brian Wansink or Mary Rosh who can offer no convincing documentation that the study or survey in question ever took place, or perhaps a legitimate-sounding survey that you suspect was constructed using “curbstoning”—that’s what they call it when the lazy survey interviewer doesn’t bother knocking on doors to talk to people and instead sits on the curbstone outside the house and makes up plausible responses? Some researchers used statistical techniques to search for duplicate or near-duplicate records and have claimed that fabricated data is a big problem with international surveys, including respected organizations such as Afrobarometer, Arab Barometer, Americas Barometer, International Social Survey, Pew Global Attitudes, Pew Religion Project, Sadat Chair, and World Values Survey.

29 thoughts on “She wants to know what are best practices on flagging bad responses and cleaning survey data and detecting bad responses. Any suggestions from the tidyverse or crunch.io?

  1. I think these are important issues for anybody who relies on surveys (which I tend to avoid if at all possible). I also think there is a vague and dangerous line between ‘cleaning’ data and polluting the results. I’ll offer a few examples. I’ve always been struck by students who fill out course evaluations and check off all 1s or all 4s – are they actually reading the questions? Of course, good survey designs will scramble the questions so that 1 doesn’t always mean the same thing. But what do you then do with someone who answers all 1s? Throw out their responses? It is a bit like the hanging chad issues in the 2000 presidential election in Palm Beach County (the butterfly ballot). Second guessing people’s responses is a slippery slope – we know some responses don’t make sense, but where do you draw the line between a response you don’t trust and people that just have strange beliefs?

    In environmental economics surveys, people are often asked about the willingness to pay for improved environmental quality. It is common for some responses to be thrown out if the amounts ‘seem’ unrealistic (extreme case: willingness to pay seems to exceed income). I’ve seen these referred to as “protest bids” at times. But surely a “protest” is important information, and discarding that bid is biasing the results.

    Another area would be probability assessments. Experiments have shown that people often assign a higher probability to a compound event (where additional specifics are included – I believe the Kahneman example was about the probability that Serena would win or the probability that she would win against Sharapova in 3 sets) than to the simpler, less constrained event. This is illogical, but exhibits real beliefs. Usually such assessments would not be discarded, but I can imagine circumstances where they might be ‘cleaned.’ It is pretty common to see experiments that have some questions to determine whether the participants understand the instructions – and their results are discarded if they demonstrate a lack of understanding. But where is the line between a lack of understanding and illogical beliefs?

    There is much room for intentional and unintentional steering of results while cleaning survey data. Question wording is almost always somewhat vague, so interpreting the responses is difficult. I think that the more sophisticated the question, the worse the problem actually becomes. If you very carefully word a question so as to avoid misunderstandings, there is then a mismatch between the effort that goes into designing the question and the effort a respondent puts into their response. That is, I think researchers often mistake their carefully designed surveys for carefully thought out responses. Most people are not very thoughtful when answering surveys (conjecture on my part), so such subtlety may be lost.

    I’m probably more dismissive of surveys than most people. I realize there may not be any other way to gauge public opinion about many things. But there are so many issues involved with design and interpretation that I try to avoid them as much as possible. Along the lines of wanting open data, I think all surveys should provide the raw survey data before and after cleaning with clear explanation of what cleaning took place.

    • Dale, those are all excellent points.

      Mercifully, I’ve only been involved with survey analysis once, in a survey about ventilation behavior in new houses. We wanted to try to quantify the extent to which people try to ensure that they get adequate ventilation, e.g. do they open a window or turn on their vent hood when cooking, di they go for days without opening any windows at all, etc. A lot of old houses are leaky enough that the air quality probably isn’t too terrible even if there’s no attempt to provide ventilation, but new houses are very airtight so the indoor air can get pretty unhealthy or even very unhealthy, especially if people have furniture that offgases a lot of formaldehyde or other toxins.

      We wanted to find out about behavior in different seasons and weather conditions, and ask about not just typical behavior but also basically how bad can it get for a full month. Everyone involved was pretty skeptical about the quality of the responses — including the UC Berkeley Survey Research Center, who helped with the survey design and conducted the survey. But what can ya do, this sort of information was seen as necessary for supporting decisions around the building code.

      Respondents were paid for participating— I think they got $30 or so if they completed the survey.

      Fortunately, the vast majority of respondents did seem to take it seriously; they didn’t just answer all the questions the same way, for example. But there were still lots of inconsistencies or at least apparent , like people who said they often had good kitchen ventilation but reported rarely opening a window or turning on the exhaust hood. Of course as part of preparation I took the survey myself — more than once, as we revised it — and even I had trouble giving fully consistent answers.

      Part of the problem was that we needed or at least wanted information about all seasons, but the survey had to be done over the course of a few weeks due to budget reasons. It would have been great to do a big survey where we asked twenty people a day a small set of questions: did you open any windows today? Did you close any? Are there any that were already open and you left them open? No need to try to make people recall their behavior eight months ago or try to estimate how many hours per week their bathroom window is open.

      In the end the experience made me glad I didn’t have to analyze survey data more often.

      • I looked at a govt survey on risky behavior about 7 years ago, and the questions about alcohol consumption. There was a question where they asked something like “in the last thirty days how many alcoholic drinks did you have on average per day”

        There were a nontrivial number of people who answered 30 or 60.

        Now, 30 drinks in 24 hours might not kill an alcoholic if they did it once. But 30 drinks average each day for 30 days will kill pretty much anyone. 60 drinks in 24 hours is gonna kill even highly adapted alcoholics if not in the first 24 hours than in the second.

        To understand this quantity 44ml of 80proof vodka is one standard drink, so 750ml bottle of vodka contains only 17 drinks, so this is close to two full bottles of vodka a day, and 60 drinks a day would be ~4 bottles of vodka per day every day for 30 days.

        And yet it was clear people had analyzed these surveys with those data points included and there was basically a whole book that seemed to rely on this data for its conclusions.

        The only reasonable conclusion for these particular answers were that people who routinely drink 1 or 2 drinks per day answered as if the question were asking about the total over 30 days not the average. It was around 2% of answers in this category and they corresponded to something like half of all alcohol sales or something dumb like that.

        The true alcoholics gave much more reasonable answers that were more specific, like 17 or 13 or something. Those were the unfortunate souls who titrate their intake very precisely.

        Reanalyzing this data basically completely changed the understanding of how many true addicts there are and how much they drink. Because data errors were about the same size as the population needing real help.

        • I agree that if someone says the had 60 drinks (per day) over the past 30 days, they really mean they had about two drinks per day.

          But what about someone who gave the answer 6? If there are many of these then I’d guess most of them had the same misinterpretation: they had six drinks in the past thirty days, not six drinks per day. But six drinks per day is far from impossible, there are (amazingly) people who drink a six-pack a day. Not as many as there are who gave roughly one drink per five days, I assume (and hope!), but not so few that you can change all those 6 answers to 6/30.

          Tough problem to have if you are analyzing the data.

        • Some surveys, when asking about behavioral frequency, allow people to pick the unit – per day, week, month, or year. We usually convert it on the backend so all responses are on the same unit.

    • I’m sympathetic to Dale’s comment because I do agree that the line between cleaning the data and justifying said cleaning of the data is blurry. The problem is because any interpretation of survey data must contend with at least two intentions: the intention of the researcher and the intention of the respondent. The survey is designed as a stand alone medium – respondents don’t have a chance to clarify or take issue with a statement if they don’t understand or don’t know how to interpret a statement. Uher (2023) provides a good example in their paper where raters provide an open-ended response as to what the item “gets nervous easily” means to them. Some interpret that item to mean not sociable or something about body language or something related to one’s performance. Respondents might have their own interpretation of words or items that differ from the researcher.

      Even then, now you have the problem of the intention of the respondent. What did they mean when the checked all 5’s for an item? How about when they checked all 3’s (i.e., some neutral response)? What about when they check all 1s? How about when they check all 1s except for maybe one or two items? I mean there is a whole literature now that tries to address careless responses from instructing attention checks, bogus items, IP address, checking response times, etc. I know that researchers tend to use attention checks as a criterion for excluding participants/responses as a way to assess if someone is paying attention to your study. But it’s not like if someone passes the attention check that they necessarily pay attention to all aspects of your study. Even if someone passes your attention check, they can still misread instructions or still give ratings of all 1’s to your study or give responses that are inconsistent. Like Dale said, it’s not clear where you draw the line once you start to second guess people’s response.

      I’m not sure what is a good solution because it depends what we mean by “bad responses”. To me, the issue is the problem of the measurement itself (i.e., that people are limited in the way that they can response in the survey) or whatever construct you are trying to measure. Perhaps people are not thoughtful when responding to surveys, but I don’t think people are always consistent in their response unless their motivated to be consistent. As Phil points out, depending on what the survey asks, it may be difficult to give consistent responses because we are not consistently thinking how our beliefs and/actions are consistent with each other.

      If you are working with surveys, the only thing I can think of is to have some open-ended response to at least verify how people are interpreting your survey items of interest. If you are measuring something like “depression”, then ask a set of respondents their examples/definitions of “depression”. At least have some way of corroborating of how people interpret your survey items.

      • When designing a new survey, or new questions for a survey, if you have the time/funding/etc., if can really help to conduct some cognitive interviews with respondents. We recently did this with some questions regarding nutrition security and learned that “healthy food” can be pretty vague term for some folks.

    • Interesting work. I’ll look more closely, but a cursory view provides a good example of my concerns. Here is an excerpt from the first source you linked to:
      “For example, a mischievous responder might find it funny to report that he eats carrots, fruit, potatoes, and salads each “four or more times a day”; is extremely tall; is unsure whether he has asthma; has never been to the dentist; and that he identifies as “gay,” even if none of these is true for this individual in reality. Thus, the presence of mischievous responders creates spurious relationships between these predictor (screener) items and sexuality that would not otherwise exist if all individuals were responding truthfully.”
      I agree such responses could indicate someone systematically lying and thus might be thrown out. But it is also possible that the person responded truthfully about their sexual orientation but felt like providing joke responses to questions about carrots. Even if I agree with the researchers about such responses, I think it underscores the ambiguous nature of such “cleaning.” It seems like the extreme of forking paths. It is good that they report results as they cull the responses, so we can get an idea of how much the cleaning affects the results, but it leaves me uncertain what, if anything, the survey tells us about the prevalence of various sexual identifies among American youth. Such uncertainty can’t be escaped, however, so I can’t really complain.

      • Imho, as with the alcohol responses I mention elsewhere, the data should be reported with no cleaning at all, and the data should be analyzed with a MODEL that includes all the causes we can imagine of the given responses including data recording errors misunderstanding and mischief.

        You inherently need to assign Bayesian probabilities here since we don’t see ANY true data on these questions. The use of this kind of model in reality seems infrequent but needs to be much more common.

  2. The survey data request I most frequently confront is after a visit to a medical care provider. I make it a point never to provide a response for fear that my remarks will be used against me in case there is some subsequent problem/issue/dispute. I am, it should be noted, nowhere more paranoid than the next man, provided the next man is Richard M. Nixon. Nevertheless, is it unreasonable of me to suspect that anonymity no longer exists? We live in a world where everything is recorded, including this innocuous confession on this blog so I deny I ever wrote this.

  3. Andrew mentioned the term, “curbstoning” and gave a definition related to nonexistent sampling. Here is a different definition of curbstoning:
    https://www.autolist.com/guides/curbstoning

    Curbstoning is an illegal scheme in which people draw car shoppers to places such as the side of the road (curbside) or a vacant lot and sell them unfit used cars. A curbstoner poses as a vehicle’s owner to avoid both city and state permits or licensing requirements.

  4. Not helpful after the fact, but the monitoring the future survey had fake drug questions to catch teenagers that just said they take all the drugs all the time.

    I have an example of finding near duplicate responses, https://github.com/apwheele/Blog_Code/blob/master/Python/SurveyMatch/DupsSurvey.ipynb, so that is for if you are concerned people retake the survey (or the survey taker themselves are fudging data). The example I use it does happen to pull up clusters of people who straight line answer the responses though.

  5. In the past I’ve visualized the rarity of an observation using “data depth”, which I often think of as the multivariate version of a quantile (see here: https://link.springer.com/chapter/10.1007/978-1-4613-0045-8_4). From the document I’m linking to:

    “A data depth is a function that indicates, in some sense, how deep a point is located with respect to a given data cloud (or to a given probability distribution) in d-space. The depth defines a center of the cloud, that is the set of deepest points, and measures how far away a point is located from the center. Various notions of data depth can be employed in procedures of multivariate data analysis, such as cluster analysis and the detection of outlying data”.

    I don’t know how well the notion of data depth works when applied to discrete or categorical responses. Still, I think it would be helpful to start looking for participants whose responses are consistently sloppy.

  6. As for tidyverse-style packages, there’s `srvyr`, which extends `survey` by implementing some of the dplyr verbs. But neither package seems to include functions for detecting or excluding suspicious responses. It may be that there are no such procedures which are standard across disciplines.

  7. I would be surprised if effective bad response flagging practices don’t depend strongly on the type of survey, questions, population, etc.; after all, sloppiness, dishonesty, etc, are psychological/sociological/political phenomena, so statistical practices that explicitly encode domain-specific experience and intuition from those fields as applicable to the particular survey — in a way, treating this as part of the data generation process rather than a technical artifact — might be more effective than more universal approaches.

  8. The search for outliers in free-form text is made more efficient with large language models (LLMs) via their remarkable ability to quantify uncertainty over high-dimensional inputs with minimal distributional assumptions.

    My general, first-pass recommendation for the problem at hand would be to model it as binary classification after labeling some examples: E.g., class 0 (“bad” responses) and class 1 (“good” responses).

    Next, use Reexpress (either the currently available visual, no-code macOS application, or alternatively, the Python code that will be released this fall) to search for outliers: For class 1, look at the documents with high epistemic uncertainty (i.e., those that have a looser connection to the class 1 labeled data) AND for class 0, look at the documents with low epistemic uncertainty (i.e., those that have a stronger connection to the class 0 labeled data). As you find outliers, label them, and repeat the process.

    What if you need to bootstrap starting from no observed examples of class 0 — or more realistically, too few examples of class 0? One possibility is to generate synthetic examples from an LLM. Alternatively, use some unrelated corpora to augment the examples of class 0, such as scientific articles or Shakespeare plays. The key is to create a contrast to the in-distribution data of class 1.

Leave a Reply

Your email address will not be published. Required fields are marked *