Noise-mining as standard practice in social science

The following example is interesting, not because it is particularly noteworthy but rather because it represents business as usual in much of social science: researchers trying their best, but hopelessly foiled by their use of crude psychological theories and cruder statistics, along with patterns of publication and publicity that motivate the selection and interpretation of patterns in noise.

Elio Campitelli writes:

The silliest study this week?

I realise that it’s a hard competition, but this has to be the silliest study I’ve read this week. Each group of participants read the same exact text with only one word changed and the researchers are “startled” to see that such a minuscule change did not alter the readers’ understanding of the story. From the Guardian article (the paper is yet to be published as I’m sending you this email):

Two years ago, Washington and Lee University professors Chris Gavaler and Dan Johnson published a paper in which they revealed that when readers were given a sci-fi story peopled by aliens and androids and set on a space ship, as opposed to a similar one set in reality, “the science fiction setting triggered poorer overall reading” and appeared to “predispose readers to a less effortful and comprehending mode of reading – or what we might term non-literary reading”.

But after critics suggested that merely changing elements of a mainstream story into sci-fi tropes did not make for a quality story, Gavaler and Johnson decided to revisit the research. This time, 204 participants were given one of two stories to read: both were called “Ada” and were identical apart from one word, to provide the strictest possible control. The “literary” version begins: “My daughter is standing behind the bar, polishing a wine glass against a white cloth.” The science-fiction variant begins: “My robot is standing behind the bar, polishing a wine glass against a white cloth.”

In what Gavaler and Johnson call “a significant departure” from their previous study, readers of both texts scored the same in comprehension, “both accumulatively and when divided into the comprehension subcategories of mind, world, and plot”.

The presence of the word “robot” did not reduce merit evaluation, effort reporting, or objective comprehension scores, they write; in their previous study, these had been reduced by the sci-fi setting. “This difference between studies is presumably a result of differences between our two science-fiction texts,” they say.

Gavaler said he was “pretty startled” by the result.

I mean, I wouldn’t dismiss out of hand the possibility of a one-word change having dramatic consequences (change “republican” to “democrat” in a paragraph describing a proposed policy, for example). But in this case it seems to me that the authors surfed the noise generated by the previous study into expecting a big change by just changing “sister” to “robot” and nothing else.

I agree. Two things seem to be going on:

1. The researchers seem to have completely internalized the biases arising from the statistical significance filter that lead to estimates being too high (as discussed in section 2.1 of this article), thus they came into this new experiment expecting to see a huge and statistically significant effect (recall the 80% power lie).

2. Then they do the experiment and are gobsmacked to find nothing (like the 50 shades of gray story, but without the self-awareness).

The funny thing is that items 1 and 2 kinda cancel, and the researchers still end up with positive press!

P.S. I looked up Chris Gavalar and he has a lot of interesting thoughts. Check out his blog! I feel bad that he got trapped in the vortex of bad statistics, and I don’t want this discussion of statistical fallacies to reflect negatively on his qualitative work.

14 thoughts on “Noise-mining as standard practice in social science

  1. I couldn’t get access to the paper. However, if they released data (and code), it should be possible to find some significant effect in some direction.

    More generally, it seems like they were on the right track in the sense that they wanted to do a controlled manipulation. That’s a good start. You have to start learning somewhere. If they question their own conclusions, as they should, and push the logic a bit further, they could easily do a phenomenal piece of work. Many of our undergraduates in linguistics start with similar questions that are basically unanswerable. We let them do their studies the way they want to; they’re excited because they care about the problem, and eventually they learn a lot through their experiences. Surprise is a great educator.

    • Shravan:

      I agree. No harm in experimenting; it’s a good way to learn. From a statistical perspective, my point is that we should not be fooled by chance variation. Do the experiment, see what you learn, but don’t shift your models of the world too much just because you see a p-value less than or more than 0.5 or 0.1 or whatever.

      • Yes, yes, you know I agree :). I was trying to say that this is a great opportunity for the researchers to learn about the vagaries of chance, by questioning their conclusions, and then designing a better study (failing, etc. etc.). They could learn this way not to be fooled by randomness.

        • I agree that this is a learning opportunity for research. However, I have a hard time understanding how an experiment in changing one word and measuring resulting comprehension provides any usable knowledge about anything – regardless of its statistical properties. It is not my field, so I am probably missing some underlying theory that would make this a reasonable research project. But, if such a theory exists, I probably wouldn’t find it very enlightening, given the highly context-dependent nature of the experiment. Would the findings tell us anything about anything other than the experimental circumstances they used (if it could even do that)?

        • The first thing we do, let’s kill all the lawyers.
          The first thing we do, let’s hire all the lawyers.

          people’s response to these two statements could tell you a lot about their psychology I think.

        • Yes, agreed, you won’t learn a thing given the current vagueness of the theory behind this. One broader insight these researchers could gain is that one can have thoughts about what is the case in the world of language. But one can’t always get experimental evidence for or against one’s ideas. Here, the research question is so vague as to be unanswerable; but one could imagine drilling down to a more and more detailed process model. Nobody has managed to come up with such a process model yet in psycholinguistics (and there are many seriously smart people there), so these researchers probably won’t make much headway.

          Even understanding that having a research question doesn’t mean you can find an answer, no matter what clever manipulation you come up with, that would be a real satori moment IMHO.

        • “Even understanding that having a research question doesn’t mean you can find an answer, no matter what clever manipulation you come up with, that would be a real satori moment IMHO”

          +1

      • Are we specifically talking about undergraduates? In my experience, except in very rare cases, the student is usually not ready intellectually to have their question recast into a tractable problem. They’ve had one course on empirical methods, and one stats course. They know how to do a simple self-paced reading study or an acceptability rating study. They can at most they know how to do the t-test. Their understanding of statistics is shaky still (which is absolutely OK). They want to try out their newly acquired knowledge on a problem *they* are curious about and really care about.

        It is very debilitating psychologically for a new student to come to a professor with a question that excites them but is fundamentally unanswerable, and then be told that that’s not a viable plan (but I have done just that in extreme cases–see below).

        From my perspective, there is value in letting students go through the paces; after all, even the mechanics of running a self-paced reading study or acceptability rating study for the first time (designing items, interacting with subjects so as not to bias them) is a huge education in itself. Should I randomize the items every time I run the experiment or only once? What kind of filler do I need? How many? Can the participants figure out what the experiment is about? How to mask that? How to filter out inattentive or non-cooperative participants? Why do I need to randomize the assignment of subjects to different groups in a Latin square?

        Then one does the study. One suddenly discovers that there are all kinds of patterns in the data one didn’t think of. The usual reaction of students is to try to posthoc revise the story. That’s a great moment because you can show them their analysis plan, which is all written up, and ask them what they think is going on with all these seemingly unexpected patterns. More often, students find nothing and want to know what could have gone wrong. That’s also a great moment for teaching, because now you the problem is very personal and real for them. I only grade theses (also PhDs) on the ideas and implementation, not on the results.

        The undergrad thesis at Potsdam is a short document which is just intended to give the student a flavor for what research is like. Some of these undergrad students go on to become really great researchers.

        Having said that, there was one case where an undergrad wanted to run an experiment with 180 conditions or something like that, to answer all possible open questions in linguistics. IIRC I did bring him down to 12 or so, and even that taught him a good lesson! :) He went on to do really phenomenal work. Some of the best work from my lab came from him.

        So from my experience, it’s good to be hands-off in these initial stages of getting into experimental science. Basically, these researchers (Andrew’s blog post) did their first (admittedly crazy) experiment; now if only someone (preferably they themselves) would question their assumptions and the non-specificity of the underlying theory. It’s silly stuff, but silly in a good and possibly useful way.

        • One thing that can at least attempt to get a start at the problem is to have students (even in an intro stats course) do a group project, where the group chooses the question to study, has to come up with a Project Proposal (which the instructor reads and sends back for revision as needed), then gathers data, etc.

          See the links under “Materials Related to Projects” at https://web.ma.utexas.edu/users/mks/M358KInstr/M358KInstructorMaterials.html
          for more details on how I have carried this out.

  2. I’m surprised you didn’t comment on the saddest part of the Guardian article, “The authors of a 2017 study which found that reading science fiction ‘makes you stupid’ have conducted a follow-up that found that it’s only bad sci-fi that has this effect: a well-written slice of sci-fi will be read just as thoroughly as a literary story.” [Emphasis added.] In other words, it’s not repudiating the earlier study, or acknowledging that it’s noise, but rather picking a new forked path to follow. Of course, this could be the Guardian, and not the researchers.

    • Dzhaughn,

      Also, p=0.20 seems like no evidence at all (or maybe evidence of zero effect!) while p=0.01 seems like very strong evidence. But the corresponding z-scores are qnorm(0.9) = 1.3 and qnorm(0.995) = 2.6. The difference between these is 1.3, which is consistent with pure noise variation, as can be seen when compared to the normal(0, sqrt(2)) sampling distribution of the difference between two independent z-scores under the null hypothesis of no difference.

      The 0.049 vs. 0.051 thing buries the lede.

Leave a Reply to Martha (Smith) Cancel reply

Your email address will not be published. Required fields are marked *