Update on keeping Mechanical Turk responses trustworthy

This topic has come up before . . . Now there’s a new paper by Douglas Ahler, Carolyn Roush, and Gaurav Sood, who write:

Amazon’s Mechanical Turk has rejuvenated the social sciences, dramatically reducing the cost and inconvenience of collecting original data. Recently, however, researchers have raised concerns about the presence of “non-respondents” (bots) or non-serious respondents on the platform. Spurred by these concerns, we fielded an original survey on MTurk to measure response quality. While we find no evidence of a “bot epidemic,” we do find that a significant portion of survey respondents engaged in suspicious be- havior. About 20% of respondents either circumvented location requirements or took the survey multiple times. In addition, at least 5-7% of participants likely engaged in “trolling” or satisficing. Altogether, we find about a quarter of data collected on MTurk is potentially untrustworthy. Expectedly, we find response quality impacts experimental treatments. On average, low quality responses attenuate treatment effects by approximately 9%. We conclude by providing recommendations for collecting data on MTurk.

And here are the promised recommendations:

• Use geolocation filters on survey platforms like Qualtrics to enforce any geographic restrictions.

• Make use of tools on survey platforms to retrieve IP addresses. Run each IP through Know Your IP to identify blacklisted IPs and multiple responses originating from the same IP.

• Include questions to detecting trolling and satisficing but do not copy and paste from a standard canon as that makes “gaming the survey” easier.

• Increase the time between Human intelligence task (HIT) completion and auto-approval so that you can assess your data for untrustworthy responses before approving or rejecting the HIT.

• Rather than withhold payments, a better policy may be to incentivize workers by giving them a bonus when their responses pass quality filters.

• Be mindful of compensation rates. While unusually stingy wages will lead to slow data collection times and potentially less effort by Workers, unusually high wages may give rise to adverse selection—especially because HITs are shared on Turkopticon, etc. soon after posting. . . Social scientists who conduct research on MTurk should stay apprised of the current “fair wage” on MTurk and adhere accordingly.

• Use Worker qualifications on MTurk and filter to include only Workers who have a high percentage of approved HITs into your sample.

They also say they do not think that the problem is limited to MTurk.

I haven’t tried to evaluate all these claims myself, but I thought I’d share it all with those of you who are using this tool in your research. (Or maybe some of you are MTurk bots; who knows what will be the effect of posting this material here.)

P.S. Sood adds:

From my end, “random” error is mostly a non-issue in this context. People don’t use M-Turk to produce generalizable estimates—hardly anyone post-stratifies, for instance. Most people use it to say they did something. I suppose it is a good way to ‘fail fast.’ (The downside is that most failures probably don’t see the light of day.) And if we people wanted to buy stat. sig., bulking up on n is easily and cheaply done — it is the raison d’etre of MTurk in some ways.

So what is the point of the article? Twofold, perhaps. First is that it is good to parcel out measurement error where we can. And the second point is about how do we build a system where the long-term prognosis is not simply noise. And what struck out for me from the data was just the sheer scale of plausibly cheeky behavior. I did not anticipate that.

6 thoughts on “Update on keeping Mechanical Turk responses trustworthy

  1. “Use Worker qualifications on MTurk and filter to include only Workers who have a high percentage of approved HITs into your sample”

    Counterpoint:

    https://psyarxiv.com/jq589

    “Here, we show that concerns about non-naivete on MTurk are due less to the MTurk platform itself and more to the way researchers use the platform. Specifically, we find there are at least 250,000 MTurk workers worldwide and that a large majority of US workers are new to the platform each year and therefore relatively inexperienced as research participants. We describe how inexperienced workers are excluded from studies, in part, because of the worker reputation qualifications researchers commonly use. Finally, we propose and evaluate an alternative approach to sampling on MTurk that allows researchers to access inexperienced participants without sacrificing data quality.”

  2. Not saying this is actually possible, but it would be great if we could develop weights for the responses of different types of respondents–0 for bots or random responders, obviously, but not necessarily for the trolls, satisficers, multi-completers, low-reputation qualifiers, geolocation circumventers, and otherwise untrustworthy groups–instead of excluding them. After all, their intentional responses technically contain information about their respective populations that might inform the study or some other, meta-study.

    • I’m not much into the details of the above paper, but could this help?: “Learning From Crowds” by Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, Linda Moy; 11(Apr):1297−1322, 2010.

  3. People don’t use M-Turk to produce generalizable estimates—

    They do when they use it to develop training data for machine learning. Or at least that’s the goal. But the generalization is more at getting to some underlying truth, not learning about the population of respondents per se.

    Rather than all the preprocessing (some of which we did), Becky and I used a noisy measurement model to adjust for Turker inaccuracy and bias. So you learn a bit about the coders/raters/annotators along the way.

    —hardly anyone post-stratifies, for instance.

    Raykar et al.’s paper shows how to jointly train a logistic regression model of prevalence that could be postratified, but is typically used for item-level prediction (it’s machine learning!).

  4. > “They also say they do not think that the problem is limited to MTurk.”

    I think this is important. Everyone who has ever ran an “in-lab” experiment knows that satiscficing also happens in the lab.

    I’ve had participants to a lab experiment take 20 seconds to complete a task we gave them 15 minutes to complete, and spending the 15 next minutes napping.

    The fault was on our payment design. Although we did not allow them to do anything else during these 15 minutes, many participants preferred putting little effort and leaving with the show-up fee + whatever they got from randomly completing the task rather than “taking the bate” and “fighting” for some extra dollars.

    Between the nappers and those who took a little more time but still spent most of the 15 minutes staring in the distance, I wouldn’t be surprised if a quarter of the data we collected was from careless participants who just fulfilled to minimum requirements to get their show-up fee payed.

  5. I ran online studies long before mturk. Advertise the study in a few places (e.g. online groups interested in psychology) and you’ll get participants that aren’t being paid and therefore have little motivation other than the subject matter. I easily got about 500 participants a month that way with very little effort.

Leave a Reply to PsyoSkeptic Cancel reply

Your email address will not be published. Required fields are marked *