She wants to know what are best practices on flagging bad responses and cleaning survey data and detecting bad responses. Any suggestions from the tidyverse or crunch.io?

Posted on October 13, 2024 9:36 AM by Andrew

A colleague who works in a field that uses a lot of survey research asks:

Can you recommend papers about detecting bad survey responses? We have some such methods where I work, but I’m curious what the Census Bureau and other big survey establishments do to flag bad responses. The Groves book doesn’t seem to have much:

My colleague continues:

I’ve looked through documentation and mostly see things they do during data collection to get good responses, which is of course great. I’m curious what’s their process during data cleaning. Are they looking for outliers relative to what someone’s other responses would predict?

I replied that this sounds like something that would be done in the tidyverse, so maybe someone from that world can offer some suggestions? Also I know that the people at the Yougov spinoff crunch.io do lots of data cleaning, so maybe they have some document they can point to.

In my own work, I’ve had to clean data—there’s an example in Appendix A of Regression and Other Stories—but I don’t have a systematic workflow for the process.

I remember when we were analyzing a How many X’s do you know survey, we somehow stumbled across one respondent who answered 7 to every question, so we threw that person out of our data, but that’s just something we happened to notice, and there could well be other bad responses there that we hadn’t noticed.

At the other extreme, what if you find yourself with possibly fabricated data, perhaps the work of some researcher such as Brian Wansink or Mary Rosh who can offer no convincing documentation that the study or survey in question ever took place, or perhaps a legitimate-sounding survey that you suspect was constructed using “curbstoning”—that’s what they call it when the lazy survey interviewer doesn’t bother knocking on doors to talk to people and instead sits on the curbstone outside the house and makes up plausible responses? Some researchers used statistical techniques to search for duplicate or near-duplicate records and have claimed that fabricated data is a big problem with international surveys, including respected organizations such as Afrobarometer, Arab Barometer, Americas Barometer, International Social Survey, Pew Global Attitudes, Pew Religion Project, Sadat Chair, and World Values Survey.

29 thoughts on “She wants to know what are best practices on flagging bad responses and cleaning survey data and detecting bad responses. Any suggestions from the tidyverse or crunch.io?”

Dale Lehman on October 13, 2024 10:04 AM at 10:04 am said:

I think these are important issues for anybody who relies on surveys (which I tend to avoid if at all possible). I also think there is a vague and dangerous line between ‘cleaning’ data and polluting the results. I’ll offer a few examples. I’ve always been struck by students who fill out course evaluations and check off all 1s or all 4s – are they actually reading the questions? Of course, good survey designs will scramble the questions so that 1 doesn’t always mean the same thing. But what do you then do with someone who answers all 1s? Throw out their responses? It is a bit like the hanging chad issues in the 2000 presidential election in Palm Beach County (the butterfly ballot). Second guessing people’s responses is a slippery slope – we know some responses don’t make sense, but where do you draw the line between a response you don’t trust and people that just have strange beliefs?

In environmental economics surveys, people are often asked about the willingness to pay for improved environmental quality. It is common for some responses to be thrown out if the amounts ‘seem’ unrealistic (extreme case: willingness to pay seems to exceed income). I’ve seen these referred to as “protest bids” at times. But surely a “protest” is important information, and discarding that bid is biasing the results.

Another area would be probability assessments. Experiments have shown that people often assign a higher probability to a compound event (where additional specifics are included – I believe the Kahneman example was about the probability that Serena would win or the probability that she would win against Sharapova in 3 sets) than to the simpler, less constrained event. This is illogical, but exhibits real beliefs. Usually such assessments would not be discarded, but I can imagine circumstances where they might be ‘cleaned.’ It is pretty common to see experiments that have some questions to determine whether the participants understand the instructions – and their results are discarded if they demonstrate a lack of understanding. But where is the line between a lack of understanding and illogical beliefs?

There is much room for intentional and unintentional steering of results while cleaning survey data. Question wording is almost always somewhat vague, so interpreting the responses is difficult. I think that the more sophisticated the question, the worse the problem actually becomes. If you very carefully word a question so as to avoid misunderstandings, there is then a mismatch between the effort that goes into designing the question and the effort a respondent puts into their response. That is, I think researchers often mistake their carefully designed surveys for carefully thought out responses. Most people are not very thoughtful when answering surveys (conjecture on my part), so such subtlety may be lost.

I’m probably more dismissive of surveys than most people. I realize there may not be any other way to gauge public opinion about many things. But there are so many issues involved with design and interpretation that I try to avoid them as much as possible. Along the lines of wanting open data, I think all surveys should provide the raw survey data before and after cleaning with clear explanation of what cleaning took place.

Reply ↓
- Joshua on October 13, 2024 12:40 PM at 12:40 pm said:
  
  Dale –
  
  Your comment reminds me of when surveys show a large % of people want to pay less in taxes, but also show that people don’t want to give up services.
  
  For example: We argue that the ostensibly strong support for lower taxes is the result of survey measures that fail to account for fiscal trade-offs.
  
  https://www.tandfonline.com/doi/full/10.1080/13501763.2024.2333856#abstract
  
  Reply ↓
  - John N-G on October 13, 2024 4:09 PM at 4:09 pm said:
    
    I wonder why they didn’t argue that the ostensibly strong support for services is the result of survey measures that fail to account for fiscal trade-offs…
    
    Reply ↓
    - Joshua on October 13, 2024 7:24 PM at 7:24 pm said:
      
      Fair enough. To answer your rhetorical question – presumably because they are advocates of taxes to pay for services.
  - Joshua on October 13, 2024 7:30 PM at 7:30 pm said:
    
    I suppose I should note that they also want to pay fewer taxes.
    
    Reply ↓
- Phil on October 13, 2024 1:24 PM at 1:24 pm said:
  
  Dale, those are all excellent points.
  
  Mercifully, I’ve only been involved with survey analysis once, in a survey about ventilation behavior in new houses. We wanted to try to quantify the extent to which people try to ensure that they get adequate ventilation, e.g. do they open a window or turn on their vent hood when cooking, di they go for days without opening any windows at all, etc. A lot of old houses are leaky enough that the air quality probably isn’t too terrible even if there’s no attempt to provide ventilation, but new houses are very airtight so the indoor air can get pretty unhealthy or even very unhealthy, especially if people have furniture that offgases a lot of formaldehyde or other toxins.
  
  We wanted to find out about behavior in different seasons and weather conditions, and ask about not just typical behavior but also basically how bad can it get for a full month. Everyone involved was pretty skeptical about the quality of the responses — including the UC Berkeley Survey Research Center, who helped with the survey design and conducted the survey. But what can ya do, this sort of information was seen as necessary for supporting decisions around the building code.
  
  Respondents were paid for participating— I think they got $30 or so if they completed the survey.
  
  Fortunately, the vast majority of respondents did seem to take it seriously; they didn’t just answer all the questions the same way, for example. But there were still lots of inconsistencies or at least apparent , like people who said they often had good kitchen ventilation but reported rarely opening a window or turning on the exhaust hood. Of course as part of preparation I took the survey myself — more than once, as we revised it — and even I had trouble giving fully consistent answers.
  
  Part of the problem was that we needed or at least wanted information about all seasons, but the survey had to be done over the course of a few weeks due to budget reasons. It would have been great to do a big survey where we asked twenty people a day a small set of questions: did you open any windows today? Did you close any? Are there any that were already open and you left them open? No need to try to make people recall their behavior eight months ago or try to estimate how many hours per week their bathroom window is open.
  
  In the end the experience made me glad I didn’t have to analyze survey data more often.
  
  Reply ↓
  - Daniel Lakeland on October 13, 2024 2:15 PM at 2:15 pm said:
    
    I looked at a govt survey on risky behavior about 7 years ago, and the questions about alcohol consumption. There was a question where they asked something like “in the last thirty days how many alcoholic drinks did you have on average per day”
    
    There were a nontrivial number of people who answered 30 or 60.
    
    Now, 30 drinks in 24 hours might not kill an alcoholic if they did it once. But 30 drinks average each day for 30 days will kill pretty much anyone. 60 drinks in 24 hours is gonna kill even highly adapted alcoholics if not in the first 24 hours than in the second.
    
    To understand this quantity 44ml of 80proof vodka is one standard drink, so 750ml bottle of vodka contains only 17 drinks, so this is close to two full bottles of vodka a day, and 60 drinks a day would be ~4 bottles of vodka per day every day for 30 days.
    
    And yet it was clear people had analyzed these surveys with those data points included and there was basically a whole book that seemed to rely on this data for its conclusions.
    
    The only reasonable conclusion for these particular answers were that people who routinely drink 1 or 2 drinks per day answered as if the question were asking about the total over 30 days not the average. It was around 2% of answers in this category and they corresponded to something like half of all alcohol sales or something dumb like that.
    
    The true alcoholics gave much more reasonable answers that were more specific, like 17 or 13 or something. Those were the unfortunate souls who titrate their intake very precisely.
    
    Reanalyzing this data basically completely changed the understanding of how many true addicts there are and how much they drink. Because data errors were about the same size as the population needing real help.
    
    Reply ↓
    - Phil on October 14, 2024 1:02 AM at 1:02 am said:
      
      I agree that if someone says the had 60 drinks (per day) over the past 30 days, they really mean they had about two drinks per day.
      
      But what about someone who gave the answer 6? If there are many of these then I’d guess most of them had the same misinterpretation: they had six drinks in the past thirty days, not six drinks per day. But six drinks per day is far from impossible, there are (amazingly) people who drink a six-pack a day. Not as many as there are who gave roughly one drink per five days, I assume (and hope!), but not so few that you can change all those 6 answers to 6/30.
      
      Tough problem to have if you are analyzing the data.
    - SurveyResearcher on October 15, 2024 6:07 PM at 6:07 pm said:
      
      Some surveys, when asking about behavioral frequency, allow people to pick the unit – per day, week, month, or year. We usually convert it on the backend so all responses are on the same unit.
- Jess on October 13, 2024 10:57 PM at 10:57 pm said:
  
  I’m sympathetic to Dale’s comment because I do agree that the line between cleaning the data and justifying said cleaning of the data is blurry. The problem is because any interpretation of survey data must contend with at least two intentions: the intention of the researcher and the intention of the respondent. The survey is designed as a stand alone medium – respondents don’t have a chance to clarify or take issue with a statement if they don’t understand or don’t know how to interpret a statement. Uher (2023) provides a good example in their paper where raters provide an open-ended response as to what the item “gets nervous easily” means to them. Some interpret that item to mean not sociable or something about body language or something related to one’s performance. Respondents might have their own interpretation of words or items that differ from the researcher.
  
  Even then, now you have the problem of the intention of the respondent. What did they mean when the checked all 5’s for an item? How about when they checked all 3’s (i.e., some neutral response)? What about when they check all 1s? How about when they check all 1s except for maybe one or two items? I mean there is a whole literature now that tries to address careless responses from instructing attention checks, bogus items, IP address, checking response times, etc. I know that researchers tend to use attention checks as a criterion for excluding participants/responses as a way to assess if someone is paying attention to your study. But it’s not like if someone passes the attention check that they necessarily pay attention to all aspects of your study. Even if someone passes your attention check, they can still misread instructions or still give ratings of all 1’s to your study or give responses that are inconsistent. Like Dale said, it’s not clear where you draw the line once you start to second guess people’s response.
  
  I’m not sure what is a good solution because it depends what we mean by “bad responses”. To me, the issue is the problem of the measurement itself (i.e., that people are limited in the way that they can response in the survey) or whatever construct you are trying to measure. Perhaps people are not thoughtful when responding to surveys, but I don’t think people are always consistent in their response unless their motivated to be consistent. As Phil points out, depending on what the survey asks, it may be difficult to give consistent responses because we are not consistently thinking how our beliefs and/actions are consistent with each other.
  
  If you are working with surveys, the only thing I can think of is to have some open-ended response to at least verify how people are interpreting your survey items of interest. If you are measuring something like “depression”, then ask a set of respondents their examples/definitions of “depression”. At least have some way of corroborating of how people interpret your survey items.
  
  Reply ↓
  - SurveyResearcher on October 15, 2024 6:10 PM at 6:10 pm said:
    
    When designing a new survey, or new questions for a survey, if you have the time/funding/etc., if can really help to conduct some cognitive interviews with respondents. We recently did this with some questions regarding nutrition security and learned that “healthy food” can be pretty vague term for some folks.
    
    Reply ↓
Ravi Shroff on October 13, 2024 11:04 AM at 11:04 am said:

My colleague Joe Cimpian has done work on detecting “mischievous responders”, particularly in the context of measuring health disparities among LGBTQ youth that could be relevant:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6215371/
https://journals.sagepub.com/doi/full/10.1177/2332858419888892
https://link.springer.com/article/10.1007/s10508-020-01661-7

Reply ↓
- Andrew on October 13, 2024 11:26 AM at 11:26 am said:
  
  Ravi:
  
  Interesting. I think my colleague is more concerned with sloppy responses than intentionally wrong answers, but perhaps similar techniques can be used to find them.
  
  Reply ↓
- Dale Lehman on October 13, 2024 11:40 AM at 11:40 am said:
  
  Interesting work. I’ll look more closely, but a cursory view provides a good example of my concerns. Here is an excerpt from the first source you linked to:
  “For example, a mischievous responder might find it funny to report that he eats carrots, fruit, potatoes, and salads each “four or more times a day”; is extremely tall; is unsure whether he has asthma; has never been to the dentist; and that he identifies as “gay,” even if none of these is true for this individual in reality. Thus, the presence of mischievous responders creates spurious relationships between these predictor (screener) items and sexuality that would not otherwise exist if all individuals were responding truthfully.”
  I agree such responses could indicate someone systematically lying and thus might be thrown out. But it is also possible that the person responded truthfully about their sexual orientation but felt like providing joke responses to questions about carrots. Even if I agree with the researchers about such responses, I think it underscores the ambiguous nature of such “cleaning.” It seems like the extreme of forking paths. It is good that they report results as they cull the responses, so we can get an idea of how much the cleaning affects the results, but it leaves me uncertain what, if anything, the survey tells us about the prevalence of various sexual identifies among American youth. Such uncertainty can’t be escaped, however, so I can’t really complain.
  
  Reply ↓
  - Daniel Lakeland on October 13, 2024 2:24 PM at 2:24 pm said:
    
    Imho, as with the alcohol responses I mention elsewhere, the data should be reported with no cleaning at all, and the data should be analyzed with a MODEL that includes all the causes we can imagine of the given responses including data recording errors misunderstanding and mischief.
    
    You inherently need to assign Bayesian probabilities here since we don’t see ANY true data on these questions. The use of this kind of model in reality seems infrequent but needs to be much more common.
    
    Reply ↓
- SurveyResearcher on October 15, 2024 6:16 PM at 6:16 pm said:
  
  There is a long history of developing scales for detecting this type of stuff in psychology…the K and L scales are the most well known to me. Of course, these are not used on surveys per se, but more so when administering diagnostic tests. See the following for more:
  
  https://www.mmpi-info.com/validity-scales
  
  Reply ↓
paul alper on October 13, 2024 11:11 AM at 11:11 am said:

The survey data request I most frequently confront is after a visit to a medical care provider. I make it a point never to provide a response for fear that my remarks will be used against me in case there is some subsequent problem/issue/dispute. I am, it should be noted, nowhere more paranoid than the next man, provided the next man is Richard M. Nixon. Nevertheless, is it unreasonable of me to suspect that anonymity no longer exists? We live in a world where everything is recorded, including this innocuous confession on this blog so I deny I ever wrote this.

Reply ↓
Raphael Nishimura on October 13, 2024 11:31 AM at 11:31 am said:

There is a PhD dissertation from the Survey and Data Science program at UMich that proposes some statistical methods for this type of identification: https://deepblue.lib.umich.edu/handle/2027.42/153403

Reply ↓
paul alper on October 13, 2024 11:47 AM at 11:47 am said:

Andrew mentioned the term, “curbstoning” and gave a definition related to nonexistent sampling. Here is a different definition of curbstoning:
https://www.autolist.com/guides/curbstoning

Curbstoning is an illegal scheme in which people draw car shoppers to places such as the side of the road (curbside) or a vacant lot and sell them unfit used cars. A curbstoner poses as a vehicle’s owner to avoid both city and state permits or licensing requirements.

Reply ↓
- Jamie on October 14, 2024 1:46 AM at 1:46 am said:
  
  That is where the survey term originates from
  
  Reply ↓
Mathias Berggren on October 13, 2024 11:57 AM at 11:57 am said:

Here’s a couple of articles on careless responses in surveys that I have referenced. They discuss how such responses may manifest, and include some suggestions for how to identify (certain types) or such responses:

https://doi.org/10.1016/j.jesp.2015.07.006
https://doi.org/10.1037/a0028085

Reply ↓
Andy W on October 13, 2024 12:00 PM at 12:00 pm said:

Not helpful after the fact, but the monitoring the future survey had fake drug questions to catch teenagers that just said they take all the drugs all the time.

I have an example of finding near duplicate responses, https://github.com/apwheele/Blog_Code/blob/master/Python/SurveyMatch/DupsSurvey.ipynb, so that is for if you are concerned people retake the survey (or the survey taker themselves are fudging data). The example I use it does happen to pull up clusters of people who straight line answer the responses though.

Reply ↓
Abner on October 13, 2024 12:45 PM at 12:45 pm said:

In the past I’ve visualized the rarity of an observation using “data depth”, which I often think of as the multivariate version of a quantile (see here: https://link.springer.com/chapter/10.1007/978-1-4613-0045-8_4). From the document I’m linking to:

“A data depth is a function that indicates, in some sense, how deep a point is located with respect to a given data cloud (or to a given probability distribution) in d-space. The depth defines a center of the cloud, that is the set of deepest points, and measures how far away a point is located from the center. Various notions of data depth can be employed in procedures of multivariate data analysis, such as cluster analysis and the detection of outlying data”.

I don’t know how well the notion of data depth works when applied to discrete or categorical responses. Still, I think it would be helpful to start looking for participants whose responses are consistently sloppy.

Reply ↓
Jesse O. on October 13, 2024 2:04 PM at 2:04 pm said:

As for tidyverse-style packages, there’s `srvyr`, which extends `survey` by implementing some of the dplyr verbs. But neither package seems to include functions for detecting or excluding suspicious responses. It may be that there are no such procedures which are standard across disciplines.

Reply ↓
Marky Mark on October 13, 2024 7:02 PM at 7:02 pm said:

The late Bill Winkler ran the Census Bureau’s record linkage group for years. His papers on model based, fast data cleaning are among the best resources on this topic.

https://en.wikipedia.org/wiki/William_E._Winkler

https://scholar.google.com/citations?hl=en&user=KpIWfmcAAAAJ&view_op=list_works&sortby=pubdate

Reply ↓
- shira on October 14, 2024 2:08 PM at 2:08 pm said:
  
  thanks ! which of his papers focuses on detecting bad responses ?
  
  Reply ↓
Athanassios Protopapas on October 14, 2024 2:55 AM at 2:55 am said:

My colleague Esther Ulitzsch develops sophisticated models of aberrant/careless responding to prune those out prior to further analysis.
https://www.uv.uio.no/cemo/english/people/aca/estheru/
https://www.researchgate.net/profile/Esther-Ulitzsch/research
https://www.psychometricsociety.org/invited-speaker/esther-ulitzsch-university-oslo

Reply ↓
Marcelo Rinesi on October 14, 2024 3:24 AM at 3:24 am said:

I would be surprised if effective bad response flagging practices don’t depend strongly on the type of survey, questions, population, etc.; after all, sloppiness, dishonesty, etc, are psychological/sociological/political phenomena, so statistical practices that explicitly encode domain-specific experience and intuition from those fields as applicable to the particular survey — in a way, treating this as part of the data generation process rather than a technical artifact — might be more effective than more universal approaches.

Reply ↓
Allen Schmaltz on October 14, 2024 5:10 AM at 5:10 am said:

The search for outliers in free-form text is made more efficient with large language models (LLMs) via their remarkable ability to quantify uncertainty over high-dimensional inputs with minimal distributional assumptions.

My general, first-pass recommendation for the problem at hand would be to model it as binary classification after labeling some examples: E.g., class 0 (“bad” responses) and class 1 (“good” responses).

Next, use Reexpress (either the currently available visual, no-code macOS application, or alternatively, the Python code that will be released this fall) to search for outliers: For class 1, look at the documents with high epistemic uncertainty (i.e., those that have a looser connection to the class 1 labeled data) AND for class 0, look at the documents with low epistemic uncertainty (i.e., those that have a stronger connection to the class 0 labeled data). As you find outliers, label them, and repeat the process.

What if you need to bootstrap starting from no observed examples of class 0 — or more realistically, too few examples of class 0? One possibility is to generate synthetic examples from an LLM. Alternatively, use some unrelated corpora to augment the examples of class 0, such as scientific articles or Shakespeare plays. The key is to create a contrast to the in-distribution data of class 1.

Reply ↓

Statistical Modeling, Causal Inference, and Social Science

She wants to know what are best practices on flagging bad responses and cleaning survey data and detecting bad responses. Any suggestions from the tidyverse or crunch.io?

29 thoughts on “She wants to know what are best practices on flagging bad responses and cleaning survey data and detecting bad responses. Any suggestions from the tidyverse or crunch.io?”

Leave a Reply Cancel reply