Hey! Participants in survey experiments aren’t paying attention.

Gaurav Sood writes:

Do survey respondents account for the hypothesis that they think people fielding the survey have when they respond? The answer, according to Mummolo and Peterson, is not much.

Their paper also very likely provides the reason why—people don’t pay much attention. Figure 3 provides data on manipulation checks—the proportion guessing the hypothesis being tested correctly. The change in proportion between control and treatment ranges from -.05 to .25, with a bulk of the differences in Qualtrics between 0 and .1. (In one condition, authors even offer an additional 25 cents to give an answer consistent with the hypothesis. And presumably, people need to know the hypothesis before they can answer in line with it.) The faint increase is especially noteworthy given that on average, the proportion of people in the control group who guess the hypothesis correctly—without the guessing correction—is between .25–.35 (see Appendix B).

So, the big thing we may have learned from the data is how little attention survey respondents pay. The numbers obtained here are in a similar vein to those in Appendix D of Jonathan Woon’s paper. The point is humbling and suggests that we need to: (a) invest more in measurement, and (b) have yet larger samples, which is an expensive way to overcome measurement error.

(The two fixes are things you have made before. I claim no credit. I wrote this because I don’t think I fully grasped how much noise there is on online surveys. And I think is likely useful to explore the consequences carefully.)

P.S. I think my Mturk paper gives one potential explanation for why things are so noisy—not easy to judge quality on surveys:

Here’s one relevant bit of datum from the turk paper: we got our estimates after recruiting workers “with a HIT completion rate of at least 95%.” This latter point also relates to a recent “reputation inflation” online paper.

So, if I’m understanding this correctly, Mummolo and Peterson are saying we don’t have to worry about demand effects in psychology experiments, but Sood is saying that this is just because the participants in the experiments aren’t really paying attention!

I wonder what this implies about my own research. Nothing good, I suppose.

P.S. Sood adds three things:

1. The numbers on compliance in M/P aren’t adjusted for guessing—some people doubtlessly just guessed the right answer. (We can back it out from proportion incorrect after taking out people who mark “don’t know.”)

2. This is how I [Sood] understand things: Experiments tell us the average treatment effect of what we manipulate. And the role of manipulation checks is to shed light on compliance.

If conveying experimenter demand clearly and loudly is a goal, then the experiments included probably failed. If the purpose was to know whether clear but not very loud cues about “demand” matter—and for what it’s worth, I think it is a very reasonable goal; pushing further, in my mind, would have reduced the experiment to a tautology–—the paper provides the answer. (But your reading is correct.)

3. The key point that I took from the experiment, Woon, etc. was still just about how little attention people pay on online surveys. And compliance estimates in M/P tell us something about the amount of attention people pay because compliance in their case = reading something—simple and brief—that they quiz you on later.

Tomorrow’s post: To do: Construct a build-your-own-relevant-statistics-class kit.

19 thoughts on “Hey! Participants in survey experiments aren’t paying attention.

  1. OTOH, if the participant in a survey doesn’t understand the question, you probably can’t put much value on the response. So you want them to 1) understand the question, but 2) not pick up on the experimenter’s underlying hypothesis. That sounds like a tall order, unless the survey question is extremely well worded.

  2. This study looks interesting in its own right, but I think there is a more important issue than detecting (or eliminating) biases due to people guessing the researcher’s intent. I’ve always been skeptical of surveys in that the designers of surveys take much more care in their creation than I think respondents take in their answers. After all, there is little or no incentive to participate – and, when there is an incentive (like paying participants), there is little or no incentive to tell the truth. Sometimes there are concerns that people will intentionally deceive (such as a marketing survey for a new product that attempts to determine someone’s willingness to pay). But I think the bigger problem is that people often (usually?) don’t know their answers and have little reason to think carefully enough to give meaningful or reliable responses.

    To return to the study cited here, I would think that an experiment that asks people to guess the researcher’s intentions might elicit more careful responses than the normal survey – it an intellectual game/challenge that might appeal to some people. However, it appears that people don’t do very well at that – and, if the reason is that they don’t pay much attention, then that just underscores my concern that surveys don’t produce reliable responses – primarily because there is little reason for respondents to think about them carefully.

    I’ve always had this reaction when I see “carefully” designed survey instruments – the implicit assumption is that questions that are finely tuned are picking up different things from respondents (e.g., which of the following (A,B,C,…) is your primary reason for doing X,Y,Z,…?) are actually getting meaningful granularity in the responses. I suspect the only thing that can reliably be picked up is general emotive responses (System 1 responses). Hence, there is little reason to design carefully nuanced questions unless the desire is to influence the responses.

    The one design that I think forces people to think carefully is when the order of responses (such as strongly agree, agree, etc.) switch in ways that force people to read the questions carefully. Then, if they consistently check one or another extreme, we can surmise that they are not thinking carefully. But I suspect that the response rate for such complex questions will be even lower than the usual low level obtained.

  3. Another way of forcing/validating survey responses is to ask the same question in subtly different ways, or perhaps the negated version of the question, with the order of responses shuffled (ideally randomised). Responders who are providing incoherent responses are, quite possibly, not paying their full attention (to put it charitably).

    • Yes, I thought about those issues – changing the order of choices is an often-used and sensible way to discover some obvious biases. And, there is the clear application to ordering of candidates on ballots, which I know has been studied. What I’ve not seen is evidence regarding how extensive various biases are in survey responses – ordering, reversing scales, incompatible responses, etc. It would be interesting to know, not just the overall response rate, but what percent of responses are deemed “reliable.” Better yet, a probabilistic assessment of the reliability of responses, which could then be used as weights in the analysis of the survey results. Does anyone know of good research along these lines?

    • Just an anecdote here. When I lived in Northern Virginia we were often called by pollsters asking about “the President’s handling of the economy.” I know the President doesn’t handle the economy. When I liked the President, I would say he was doing great. If they kept asking the same question in different ways, I would get cynical about the premises and start changing my answers to No Opinion.

  4. Here’s a direct study of the magnitude of experimenter demand effects (which relies on some assumptions regarding defiers): https://www.aeaweb.org/articles?id=10.1257/aer.20171330. Here’s the abstract:

    We propose a technique for assessing robustness to demand effects of findings from experiments and surveys. The core idea is that by deliberately inducing demand in a structured way we can bound its influence. We present a model in which participants respond to their beliefs about the researcher’s objectives. Bounds are obtained by manipulating those beliefs with “demand treatments.” We apply the method to 11 classic tasks, and estimate bounds averaging 0.13 standard deviations, suggesting that typical demand effects are probably modest. We also show how to compute demand-robust treatment effects and how to structurally estimate the model.

    Regarding incentives for truthtelling; that’s why experimental economists doggedly insist on incentive compatible elicitation. Not without it’s own issues, and hard to do for some types of elicitation (such as political attitudes), but when applicable it’s definitely worth giving it a thought!

    • Incentive compatibility can ensure (somewhat) that people won’t lie, but that is not quite the same thing as ensuring they will tell the truth – when determining the truth takes effort on their part. Survey respondents often don’t really know their answers to complex questions, so they have little incentive to invest the effort to find out. Since there is no way to determine whether they have told the truth, it is hard to conceive of an experiment that could work in such cases.

      I don’t think this is an unusual case. Much market research is designed to elicit participants preferences about novel circumstances – where they may not know which color, functionality, price, etc. they prefer. You can eliminate many incentives to lie, but how can you provide an incentive for them to figure out what the truth is (and then tell it to you).

      I suspect that political views are not as problematic. Simple surveys like which candidate I intend to vote for, whether I intend to vote, how important I think issue X is, etc. probably don’t entail much effort on my part to know how I feel. Then, eliminating incentives to lie (like the poll respondents that did not want to admit they would vote for Trump in the last election) may be feasible. But if the poll questions become more complex, then it starts looking to me like the market research questions.

  5. “In one condition, authors even offer an additional 25 cents to give an answer consistent with the hypothesis.”

    Who’s going to do anything for twenty-five cents? I’m sure if you’re in Botswana or Canada (poke!:), US$0.25 is like gold. Not in the US.

    I’ve seen a lot of research where the monetary incentives and/or values are…well…not likely to produce realistic outcomes. Has anyone bothered to figure out what level of monetary incentive is necessary to get realistic interaction from participants? That seems like a fundamental thing to establish. seems obvious also that the amount of reward that’s necessary for considered interaction would vary with age, job and income.

    • Yeah, imagine the question is something really complex, like which of several strategies offered by different political thinktanks do you prefer to “fix” the US healthcare system?

      So, for $50,000 or so I’ll spend the 3 to 6 months continuously evaluating each of these strategies doing a bunch of quantitative research, and give you my opinion… Anything short of that is just asking me “which of these thinktanks do I trust more” or “which of these sounds cool” or something equally pointless.

      • Actually I’m thinking back to some books I read several years back, one by Dan Ariely but some others too, where a) college students (no life experience or experience managing money); b) were given money (they didn’t sweat to earn it or get to keep it); c) and it was some puny amount (a dollar). Significant conclusions were then drawn from how they spent it. I mean really?

        If you’re making $100K a year, why would you bother with some silly survey that does nothing for you? OTOH, if you’re making $10/hr and working lots of OT, you might want a big pile of quarters but, since you’re answering questions between OT shifts you’re not likely to consider them carefully.

        • exactly the effect of monetary incentives will be widely varying across the population. I suppose it would be interesting to do experiments on randomizing the amounts and seeing how much the effect was in any given population. the results certainly wouldn’t generalize very well

  6. “I’ve seen a lot of research where the monetary incentives and/or values are…well…not likely to produce realistic outcomes. Has anyone bothered to figure out what level of monetary incentive is necessary to get realistic interaction from participants? That seems like a fundamental thing to establish. seems obvious also that the amount of reward that’s necessary for considered interaction would vary with age, job and income.”

    Good question.

  7. The questions about magnitude of the incentives are important. Accordingly, they have been investigated empirically for a long time — though perhaps not as much as they should have been. The largest amount of research on stake size concerns choice in social preference experiments. The usual trick is to run the experiment in a low-income country in which stakes on the order of weeks or months of participants’ usual salary are still affordable for US-based researchers. Here are the three most well-known examples:

    https://www.jstor.org/stable/2998575?seq=1#metadata_info_tab_contents
    https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1465-7295.1999.tb01415.x
    https://www.aeaweb.org/articles?id=10.1257/aer.101.7.3427

    There is also work on incentives as far as belief elicitation is concerned. Here’s an example: https://onlinelibrary.wiley.com/doi/full/10.1111/ecoj.12160. Isabel Trevino and Andrew Schotter have a review paper about that in the Annual Review of Economics.

    • Sandro, thanks for the links!

      I like this:
      “In fact, our paper is the first to present evidence that as stakes increase, rejection rates approach zero. ”

      I don’t know the theory but that’s what I would expect over a range of stakes varying from worthless to highly valuable. I suspect that the limits of the variation also depend on how the person comes by the money (earned or given); and, if they earned it, a) what percentage it is of their spending budget; b) what percentage it is of their total earnings; c) what they do for a living (e.g., how much sweat and torture was expended in earning it). We might see the curve shift downward, for example, when people are using their own money; and upward if they have a high income.

      In my mind I’m thinking of the ultimatum game like a binary chemical system, where the relationships can be plotted in a plane (stakes, experience), then the next step would be to explore the other dimensions of the system (income; “hardness” to earn; etc), and see if one can build out a reliable relationship between all these variables…

      • The social preferences literature is vast. Your suggestion about earned income vs. windfall gains has been answered in multiple experiments (and yes, people are more likely to redistribute windfalls), as have been many other dimensions. There’s also a large number of theories. Here’s a somewhat recent review, from the Handbook of Experimental Economics: https://www.asc.ohio-state.edu/kagel.4/HEE-Vol2/Other_regarding_all_11_14.pdf

        • Sandro, thanks again for the link, that will take a little time to cover!

          “Your suggestion about earned income vs. windfall gains has been answered…as have been many other dimensions. ”

          That will be interesting to assess. “has been answered” is almost always too strong for social sciences. :)

        • Sandro, thanks again for the link, very interesting!

          “Your suggestion about earned income vs. windfall gains has been answered in multiple experiments as have been many other dimensions.”

          That will be interesting to assess! “has been answered” is pretty strong for anything in the social sciences. :)

          (my first reply went into hyperspace, I presume it will emerge later for a double response)

Leave a Reply to jim Cancel reply

Your email address will not be published. Required fields are marked *