Should we continue not to trust the Turk? Another reminder of the importance of measurement

From 2013: Don’t trust the Turk

From 2017 (link from Kevin Lewis), from Jesse Chandler and Gabriele Paolacci:

The Internet has enabled recruitment of large samples with specific characteristics. However, when researchers rely on participant self-report to determine eligibility, data quality depends on participant honesty. Across four studies on Amazon Mechanical Turk, we show that a substantial number of participants misrepresent theoretically relevant characteristics . . .

For some purposes you can learn a lot from these online samples, but it depends on context. Measurement is important, and it is underrated in statistics.

The trouble is if you’re cruising along treating “p less than .05” as your criterion of success, then quality of measurement barely matters at all! Gather your data, grab your stars, get published, give your Ted talk, and sell your purported expertise to the world. Statistics textbooks have lots about how to analyze your data, a little bit on random sampling and randomized experimentation, and next to nothing on gathering data with high reliability and validity.

13 thoughts on “Should we continue not to trust the Turk? Another reminder of the importance of measurement

  1. Probably because detecting these problems requires empirical observation and common sense. They can’t be derived mathematically from first principles.

  2. Come on Andrew, the article is important for sure, but you left out the important conditions under which this lying happens — namely, when researchers use eligibility requirements to limit participation to a rare population and the respondents are able to deduce those requirements. Here is the rest of the sentence you cut off:

    “…to meet eligibility criteria explicit in the studies, inferred by a previous exclusion from the study or inferred in previous experiences with similar studies.”

    The authors also provide ways to reduce lying in MTurk surveys. So the takeaway implication of their findings is not to distrust MTurk, but to learn how use it correctly. Of all people, I would have expected that if you were going to blog about MTurk research, that you would have taken a more balanced approach, discussing both the strengths and weaknesses of the data source. But so far I’ve only seen negative posts. Why not discuss an article like the following?

    Mullinix, Kevin J., Thomas J. Leeper, James N. Druckman, and Jeremy Freese. 2015. The generalizability of survey experiments. Journal of Experimental Political Science 2:109-38.

    • Justin:

      You ask, “Why not discuss an article like the following?” The quick answer is that nobody sent it to me! I have not read comprehensively within this subfield. Here’s the abstract of the paper you mention:

      Survey experiments have become a central methodology across the social sciences. Researchers can combine experiments’ causal power with the generalizability of population-based samples. Yet, due to the expense of population-based samples, much research relies on convenience samples (e.g. students, online opt-in samples). The emergence of affordable, but non-representative online samples has reinvigorated debates about the external validity of experiments. We conduct two studies of how experimental treatment effects obtained from convenience samples compare to effects produced by population samples. In Study 1, we compare effect estimates from four different types of convenience samples and a population-based sample. In Study 2, we analyze treatment effects obtained from 20 experiments implemented on a population-based sample and Amazon’s Mechanical Turk (MTurk). The results reveal considerable similarity between many treatment effects obtained from convenience and nationally representative population-based samples. While the results thus bolster confidence in the utility of convenience samples, we conclude with guidance for the use of a multitude of samples for advancing scientific knowledge.

      This seems reasonable to me. We can build trust in survey experiments by replication in different settings.

        • Here is a less vague statement from the article Martha, and they even cite Andrew:

          “In sum, 29 (or 80.6%) of the 36 treatment effects in Figures 2 and 3 estimated from TESS are replicated by MTurk in the interpretation of the statistical significance and direction of treatment effects. Importantly, of the 7 experiments for which there is a significant effect in one sample, but a null result in the other, only one (Experiment 20) actually produced a significantly different effect size estimate (Gelman and Stern 2006). Across all tests, in no instance did the two samples produce significantly distinguishable effects in substantively opposite directions.”

        • I haven’t been able to get a copy of the Chandler and Paolacci paper yet, so I can’t check this out, but the sentence you quote seems pretty weird — it reads as if the Gelman and Stern 2006 reference is for Experiment 20, which seems highly unlikely. What seems more plausible is that the citation is really a reference to the phrase “significantly different effect size estimate” — but that phrase is not original to Gelman and Stern! So I’m guessing that the reference is to the subject of the Gelman-Stern paper: That “changes in statistical significance are often not themselves statistically significant”– and which G-S clarify as “even large changes in significance levels can correspond to small, nonsignificant changes in the underlying quantities.”

        • I don’t think it is enough to show that effect sizes are not significantly different, because this can happen if TESS and/or MTurk effects have large confident intervals (which appears to be the case for at least some of effects shown in figures 3 and 3). An equivalence test and reporting of standardized biases would be more informative.
          More generally, how much one can learn from presence, absence, or size of selection bias for the treatments investigates in this paper depends on how representative those treatments are for type of treatment one is interested in.

  3. Another important note about p-values and error terms…

    One could make an argument that p-values have been the cause of the field of machine learning to overtaking the field of statistics in terms predictive modeling. The reason I say this is that too many statisticians are overly concerned with preserving false positive rates, even when that’s totally unrelated to the actual goal of your model. For example, if you head over cross-validated and do a quick search for “stepwise AIC”, you will see lots of people saying “Never do this because the p-values cannot be trusted”, even when the question is about a predictive model! Now, stepwise AIC is typically not the best option out there, but if your goal is to reduce predictive error when you have an extremely large number of covariates, it’s really not the worst! The worst is using non-penalized MLE. But because many statisticians are afraid of invalidating p-values, a knee-jerk reaction from many (not all) is to say no to model building.

    Meanwhile, ML researchers traditionally have not cared so much about p-values, and quickly realized that with lots of data and lots of model choices, you *do* want to do model building to reduce predictive error.

    Of course, things aren’t black and white. Plenty of statisticians don’t blindly follow the p-value, even if they are Frequentist in their work. And plenty of machine learning researchers are starting to get interested in testing null hypotheses. But in general, I think the influence of NHST thinking has slowed down the field of statistics in terms of accepting model-building as a necessary step in predictive modeling.

  4. “Statistics textbooks have…next to nothing on gathering data with high reliability and validity.”

    This seems an accurate statement in today’s world where the trend is toward low cost data science and machine learning. Important subdisciplines of statistics such as psychometrics, which explicitly considers issues related to reliability and validity, seem to have been relegated to the dustbin of history. This is true in spite of the many data quality watchdog entities out there such as the Information Quality International group ( or the (likely short-lived) organizational roles being created for Chief Data Officers.

    While I think your statement may not be as appropriate for RDBMSs (relational database mgmt systems) containing structured data — although even RDBMSs are far from immune wrt data quality concerns — it seems quite appropriate for the massive quantities of unstructured data that are being mined today. There are literally thousands of APIs which can change or be updated at a moment’s notice, rendering obsolete any machine learning algorithm built to extract information from them. Similarly, there are hundreds of thousands (or more) web scraping scripts that are, if anything, even more vulnerable to this type of continuous innovation, evolution and disruption.

  5. You mention that measurement is the most important thing that statistics books don’t cover much. I’m wondering: What is a list of papers (or books, chapters, etc.) that focus on measurement and reliability and other underrepresented things that I should read? What can I read to supplement the stuff that I’ve been taught in books and grad classes?

  6. Data Cleaning to assure that data have ‘high reliability and validity’ is a very, very difficult subject. Most commercial software does ‘profiling’ that tells what the problems with data are. If you naively believe that the software can tell you how many duplicates are in data or can tell you (or do clean-up) how to fill in for missing data satisfying edit rules (i.e., a child under 16 cannot be married), then you are mistaken.
    The statistical agencies used manual methods for clean-up that evolved into principled methods due of record linkage (Fellegi and Sunter, JASA, 1969) and modeling/edit/imputation (Fellegi and Holt, JASA, 1976). Developing methods and software for the clean-up of data is still a very active area of research with record linkage methods primarily involving computer science (machine learning) and the edit-part of edit/imputation primarily involving OR (set covering, integer programming).
    If you have 5% error in your data, how will you know? If you have 5% error in the data, what analyses can you do?
    You may be familiar with some books. Here is one. There are others.
    Herzog, T. N., Scheuren, F., and Winkler, W.E., (2007), Data Quality and Record Linkage Techniques, New York, N. Y.: Springer.

    When I introduced the current software, I taught two short courses at the University of London. Much of the research is by computer scientists.
    I also lectured on some of the methods when I was back at the Isaac Newton Institute in September 2016. We need more research/development by statisticians.
    Xiao-Li Meng, in particular, has pointed out that, if an administrative data source is not sufficiently cleaned up while still covering 96% of a population, then combining its information with a relatively high quality probability sample of data will yield worse results than not using the administrative source. See section 4 of his overview chapter.

Leave a Reply

Your email address will not be published. Required fields are marked *