Wow, just wow. If you think Psychological Science was bad in the 2010-2015 era, you can’t imagine how bad it was back in 1999

Shane Frederick points us to this article from 1999, “Stereotype susceptibility: Identity salience and shifts in quantitative performance,” about which he writes:

This is one of the worst papers ever published in Psych Science (which is a big claim, I recognize). It is old, but really worth a look if you have never read it. It’s famous (like 1400 citations). And, mercifully, only 3 pages long.

I [Frederick] assign the paper to students each year to review. They almost all review it glowingly (i.e., uncritically).

That continues to surprise and disappoint me, but I don’t know if they think they are supposed to (a politeness norm that actually hurts them given that I’m the evaluator) or if they just lack the skills to “do” anything with the data and/or the many silly things reported in the paper? Both?

I took a look at this paper and, yeah, it’s bad. Their design doesn’t seem so bad (low sample size aside):

Forty-six Asian-American female undergraduates were run individually in a laboratory session. First, an experimenter blind to the manipulation asked them to till out the appropriate manipulation questionnaire. In the female-identity-salient condition, participants (n = 14) were asked [some questions regarding living on single-sex or mixed floors in the dorm]. In the Asian- identity-salient condition, participants (n = 16) were asked [some questions about foreign languages and immigration]. In the control condition, participants (n = 16J were asked [various neutral questions]. After the questionnaire, participants were given a quantitative test that consisted of 12 math questions . . .

The main dependent variable was accuracy, which was the number of mathematical questions a participant answered correctly divided by the number of questions that the participant attempted to answer.

And here were the key results:

Participants in the Asian-identity-salient condition answered an average of 54% of the questions that they attempted correctly, participants in the control condition answered an average of 49% correctly, and participants in the female-identily-salient condition answered an average of 43% couectly. A linear contrast analysis testing our prediction that participants in the Asian-identity-salient condition scored the highest, participants in the control condition scored in the middle, and participants in the female-identity-salient condition scored the lowest revealed that this pattern was significant, t(43) = 1.86, p < .05. r = .27. . . .

The first thing you might notice is that a t-score of 1.86 is not usually associated with “p < .05"--in standard practice you'd need the t-score to be at least 1.96 to get that level of statistical significance--but that's really the least of our worries here. If you read through the paper, you'll see lots and lots of researcher degrees of freedom, also lots of comparisons of statistical significance to non-significance, which is a mistake, and even more so here, given that they’re giving themselves license to decide on an ad hoc basis whether to count each particular comparison as “significant” (t = 1.86), “the same, albeit less statistically significant” (t = 0.89), or “no significant differences” (they don’t give the t or F score on this one). This is perhaps the first time I’ve ever seen a t score less than 1 included in the nearly-statistically-significant category. This is stone-cold Calvinball, of which it’s been said, “There is only one permanent rule in Calvinball: players cannot play it the same way twice.”

Here’s the final sentence of the paper:

The results presented here cleariy indicate that test performance is both malleable and surprisingly susceptible to implicit sociocultural pressures.

Huh? They could’ve saved themselves a few bucks and not run any people at all in the study, just rolled some dice 46 times and come up with some stories.

But the authors were from Harvard. I guess you can get away with lots of things if you’re from Harvard.

Why do we say this paper is so bad?

Why do we say this paper is so bad? There’s no reason to suspect the authors are bad people, and there’s no reason to think that the hypothesis they’re testing is wrong. If they could do a careful replication study with a few thousand students at multiple universities, the results could very well turn out to be consistent with their theories. Except for the narrow ambit of the study and the strong generalizations made from just tow small groups of students, the design seems reasonable. I assume the experiments were described accurately, the data are real, and there were no pizzagate-style shenanigans going on.

But that’s my point. This paper is notably bad because nothing about it is notable. It’s everyday bad science, performed by researchers at a top university, supported by national research grants, published in a top journal, cited 1069 times when I last checked—and with conclusions that are unsupported by the data. (As I often say, if the theory is so great that it stands on its own, fine: just present the theory and perhaps some preliminary data representing a pilot study, but don’t do the mathematical equivalent of flipping a bunch of coins and then using the pattern of heads and tails to tell a story.)

Routine bad science using routine bad methods, the kind that fools Harvard scholars, journal reviewers, and 1600 or so later researchers.

From a scientific standpoint, things like pizzagate or that Cornell ESP study or that voodoo doll study (really) or Why We Sleep or beauty and sex ratio or ovulation and voting or air rage or himmicanes or ages ending in 9 or the critical positivity ratio or the collected works of Michael Lacour—these are miss the point, as each of these stories has some special notable feature that makes them newsworthy. Each has some interesting story, but from a scientific standpoint each of these cases is boring, involving some ridiculous theory or some implausible overreach or some flat-out scientific misconduct.

The case described above, though, is fascinating in its utter ordinariness. Scientists just going about their job. Cargo cult at its purest, the blind peer-reviewing and citing the blind.

I guess the Platonic ideal of this would a paper publishing two studies with two participants each, and still managing to squeeze out some claims of statistical significance. But two studies with N=46 and N=19, that’s pretty close to the no-data ideal.

Again, I’m sure these researchers were doing their best to apply the statistical tools they learned—and I can only assume that they took publication in this top journal as a signal that they were doing things right. Don’t hate the player, hate the game.

P.S. One more thing. I can see the temptation to say something nice about this paper. It’s on an important topic, their results are statistically significant in some way, three referees and a journal editor thought it was worth publishing in a top journal . . . how can we be so quick to dismiss it?

The short answer is that the methods used in this paper are the same methods used to prove that Cornell students have ESP, or that beautiful people have more girls, or embodied cognition, or all sorts of other silly things that the experts used to tell us “have no choice but to accept that the major conclusions of these studies are true.”

To say that the statistical methods in this paper are worse than useless (useless would be making no claims at all; worse than useless is fooling yourself and others into believing strong and baseless claims) does not mean that the substantive theories in the paper are wrong. What it means is that the paper is providing no real evidence for its theories. Recall the all-important distinction between truth and evidence. Also recall the social pressure to say nice things, the attitude that by default we should believe a published or publicized study.

No. This can’t be the way to do science: coming up with theories and then purportedly testing them by coming up with random numbers and weaving a story based on statistical significance. It’s bad when this approach is used on purpose (“p-hacking”) and it’s bad when done in good faith. Not morally bad, just bad science, not a good way of learning about external reality.

57 thoughts on “Wow, just wow. If you think Psychological Science was bad in the 2010-2015 era, you can’t imagine how bad it was back in 1999

  1. “The main dependent variable was accuracy, which was the number of mathematical questions a participant answered correctly divided by the number of questions that the participant attempted to answer.”

    Wow! So, a participant who only attempts the problem(s) he or she knows how to do and then gets it (them) right scores 100%. That seems like it might be tricky to interpret….

  2. “Not morally bad, just bad science,…”

    I’m ambivalent about this. These researchers spend other people’s money that was given to them in good faith.

    • I’m not ambivalent at all. It’s morally bad to do this kind of stuff. It’s equivalent to the “snake oil salesman” of the late 1800’s selling bullshit medicines with harmful ingredients to unsuspecting frontier townspeople.

        • It’s morally bad if they *should have known better* and if a relatively simple search of say Paul Meehl’s papers from the 1960’s would have informed them, and there are many such similar papers from others, and if simple simulation studies within reach of even 1999 era computers could have shown them the folly of their ways, and so on and soforth… then yes. It’s like the snake oil salesman just listening to what the snake oil company says and purposefully doesn’t do any real external research to see if the snake oil is in fact doing what the company says.

          That there are whole fields built up of just such snake oil salesmen, and whole university departments devoted to promoting such snake oil and their salespeople is also a serious **moral** failing.

          The problem here is that although what I’m saying above seems obvious, it’s not at all accepted. Even Andrew shies away from coming out and saying “The Emperor Has No Clothes, and has been actively harming his citizens for fun and profit”

        • Do you suspect that such a call-out would indict colleagues? That’s made me very careful about what I say.
          I wonder about everyone else.

        • I personally find it takes a lot of effort to avoid working with this type of project if you’re in academia. My own personal collaborators have been chosen carefully and I don’t think it would indict them. But I left academia after my PhD precisely because I didn’t think it was morally acceptable to participate in these kinds of shenanigans and yet it appeared to be nearly required by granting agencies and dept promotional policies.

          I know there are MANY students these days who are opting out of academia after PhDs or postdocs for similar reasons.

          There are many parallels with Serpico in my opinion.

  3. More context from the last paragraph of the paper:

    “Finally, finding that academic performance can be helped as well as hindered through implicit shifts in identification raises important challenges to notions of academic performance and intelligence. Although there is considerable debate about the nature of intelligence (Fraser, 1995; Neisser et al., 1996), strong supporters of genetic differences in IQ assume that ability is fixed and can be quantified through testing (Hermstein &. Murray, 1994). The results presented here clearly indicate that test performance is both malleable and surprisingly susceptible to implicit sociocultural pressures.”

    The reference to “Hermstein and Murray, 1994” refers to the book “The Bell Curve.” I get the impression that the authors tortured the data until it confessed to a refutation of the claims in the book. That was the goal and that is why it has been cited so many times.

    I’m not endorsing the book, just sayin’.

    • “The results presented here clearly indicate”

      **clearly**! :)

      But amazingly after all the contrivance they still weasel-word the statement with “indicate” instead of “show”! What’s the difference between “clearly indicate” and “clearly show” or “clearly demonstrate”?

  4. This basically all makes sense to me, but slightly confused about this line:

    > The first thing you might notice is that a t-score of 1.86 is not usually associated with “p < .05"–in standard practice you'd need the t-score to be at least 1.96 to get that level of statistical significance–but that's really the least of our worries here.

    Isn't this normal practice for one-sided hypothesis test? (And z=1.96 is simply the p < .05 bar for a two-sided hypothesis). Is it simply that setting a one-sided hypothesis to weaken the requirements for p < .05 is ill advised?

    To be clear, I get that this is FAR from the point. The issue isn't the precise t-score they computed, it's the entire setup. The "hypothesis test" they examine is gibberish, for all the listed reasons (& the endless researcher degrees of freedom involved), so a result of t=2.5 would hardly be much better. (I also note the weird choice to study accuracy with a denominator of "questions attempted"–if that's actually what they did, that doesn't make much sense, imagine if you structured a typical math test that way…).

    • Magpie:

      The point here is that if you’re working within the hypothesis testing framework (which I don’t like), it’s standard practice to use the 2-sided test. The article states, “The results presented here clearly indicate that test performance is both malleable and surprisingly susceptible to implicit sociocultural pressures,” which among other things implies that effects could go in any direction. As you say, this is the least of the problems, as once you talk about “malleable and surprisingly susceptible” etc., you’re way deep into piranha territory.

      • I think they are using an alpha of .1. They may also be doing one-tailed. Both are pretty standardly described in intro stats books for psychologists.

        • Elin:

          I think it would be more accurate to say that they looked at a lot of data summaries and used these in a flexible way to tell a story. That explains how they could characterize a t score of 0.89 as “less statistically significant”; if it didn’t fit their story they could’ve called it a null or just not mentioned it.

  5. Apologies if the answer to this question seems obvious, but I am not familiar with this problem. What was wrong with psychological science in 2010-2015? How did the problem start and end?

  6. This article is full of contradictions. What are you saying? Is the research paper’s methods flawed and that’s why it’s bad or is it bad because it’s results are not “notable” enough (whatever that means)? You then make claims that a lot of notable studies are bad because they are too ridiculous or, reasonably but in no way an aide to your argument, unscientific? There’s no actual argument against the paper aside from an appeal to statistical significance and a low population sample, both of which you excuse in face of the seemingly reasonable design of the study. Plus, excuse me if I’ve been told wrong, but I’m pretty confident stereotype threat is a real, well documented phenomenon. Of course that does not mean the study is good, but it can explain why it is able to get away with it’s supposed flaws.

    • Abyss:

      To answer your questions in order:

      1. What I’m saying is that I agree with Shane Frederick that the paper is bad.

      2. Yes, I’m saying the paper’s methods are flawed. I wouldn’t say “that’s why it’s bad”; rather, I’d say that the paper has both methodological and conceptual flaws.

      3. No, I don’t think the paper is bad because its results are not notable. What I wrote is, “This paper is notably bad because nothing about it is notable.” What I meant was that the most notable thing about this story is how commonplace the problems were with this paper. I was writing this in a paradoxical way so I can see how this could have been confusing. When I said “nothing about it is notable,” I wasn’t talking about whether it is making notable claims; rather, I was talking about how there was nothing notable about its failures, which are shared by many papers.

      4. You ask for an “actual argument against the paper.” I just don’t think they demonstrated anything they claimed to demonstrate. To put it another way, whatever positive claims they are making are based on statistical significance, which in turn is based on the idea that, if there were nothing going on and the data were pure noise, that such low p-values could not be found. But that reasoning is correct. As we’ve known for many years (at least since the famous 2011 paper by Simmons, Nelson, and Simonsohn on researcher degrees of freedom and p-hacking) it’s easy to get apparently statistically significant p-values from pure noise. Thus, they’re offering no real evidence for their claims. The low sample size is not a problem on its own; it’s just a clue that results will be noisy, which allows researchers to find apparently large patterns from pure noise. Given all this, it’s not particularly relevant if the design of the study is seemingly reasonable. It can be seemingly reasonable but just too noisy for anything useful to be learned.

      5. Stereotype threat may well be a real, well documented phenomenon. This is a topic of debate that we don’t need to get into here. As always in this sort of study, the lack of evidence for a claim does not imply that the claim is false. If someone wants to write a review article about stereotype threat or just a personal statement of belief in the idea, that’s fine; my problem is with a claim of evidence when there is none.

      6. I think the main reason the article was published despite is fatal flaws is that, back in 1999, researchers were not so aware of the problems with this sort of approach to research in the face of uncertainty. As I wrote, “I’m sure these researchers were doing their best to apply the statistical tools they learned—and I can only assume that they took publication in this top journal as a signal that they were doing things right. Don’t hate the player, hate the game.” Also see the last paragraph of the above post.

    • It is not well documented topic at all every preregistrated study of this effect fail and last meta-analysis (done by Witchers) said it does not exist.

  7. “As we’ve known for many years (at least since the famous 2011 paper by Simmons, Nelson, and Simonsohn on researcher degrees of freedom and p-hacking) it’s easy to get apparently statistically significant p-values from pure noise. Thus, they’re offering no real evidence for their claims.”

    You are implying that the authors engaged in p-hacking, but you don’t seem to have provided evidence for this, or is your position that as long as p-hacking is a *possible* explanation for the results in a paper, then we should simply presume guilt?

    You did mention that there were lots of researcher degrees of freedom available, but is there any evidence that they took advantage of those degrees of freedom in order to p-hack a significant result (e.g. that they performed unreported comparisons, or failed to perform multiple comparison corrections, etc)?

    • Andy:

      There’s no “guilt,” especially considering that this paper was published over a decade before Simmons et al., and the authors were using standard practices! As I wrote above, “It’s bad when this approach is used on purpose (‘p-hacking’) and it’s bad when done in good faith.” Actually I prefer the term “forking paths” to “p-hacking” for exactly the reason that you say, that p-hacking sounds intentional whereas forking paths can occur with no intent. For further discussion see my paper with Loken. As we discuss in that paper, these problems arise even if the researchers reported every comparison they did. No intent is required; all that is needed is that the authors used the standard approach to data analysis, which was to go through the data and look for interesting things. The background is that when variation is high, it will be easy to find interesting-looking results from pure noise. This doesn’t mean that the substantive claims in the paper are wrong (or that they’re right); it’s just that these sorts of p-values are easy to obtain from noise, even with researchers following standard (for 1999) practice and with no hiding of comparisons or malign intent. I think that speaking of “guilt” is really the wrong thing to do here.

      • “these problems arise even if the researchers reported every comparison they did. No intent is required; all that is needed is that the authors used the standard approach to data analysis, which was to go through the data and look for interesting things.”

        If they really went through the data to “look for interesting things”, then the implication is that they did additional *informal* comparisons or assessments, and that those were unreported. But again, you don’t seem to be offering evidence that they actually made this kind of forking-path error, as opposed to merely saying that it’s *possible* that they did so? You say it was standard practice, but even at the time researchers were well aware that adjusting their hypothesis based on the data could be problematic, even if it was more common than it should have been, and under-discussed, and possible to do it unintentionally, etc.

        “There’s no ‘guilt’… I think that speaking of ‘guilt’ is really the wrong thing to do here.”

        Well, you said the paper was “notably bad” and “bad science”, which suggests that the authors were guilty of committing a scientific error, though not necessarily intentionally. Whether we want to use the word “guilty” for this seems somewhat semantic, but “bad science” is a serious allegation, particularly if your only evidence for the forking-path claim is that they *might* have adjusted their hypothesis based on the data.

        (I’m reposting this comment, since it didn’t seem to go through last time)

        • By the way, even if you are only intending a counterfactual claim that they *would* have tested a different hypothesis if the data had been different (per your link), then this seems even less reasonable as a basis for alleging “bad science” or forking-path errors, since this is pure speculation, i.e. you really don’t what they would have done in those counterfactuals (e.g. perhaps they would have included qualifiers in that case about having shifted their initial hypothesis based on the data, etc).

          It’s one thing to say that these kinds of possibilities are a reason to be skeptical of non-preregistered studies (which is true), but it’s another to simply assert scientific errors as fact in cases where the errors are really just being assumed or speculated about. And my perception is that you often do this when critiquing alleged errors in studies. That said, if you have evidence that they actually did commit these kinds of forking-path errors, I’m curious to hear it.

        • Andy:

          Thanks for following up. These are tricky questions that have confused generations of social scientists and statisticians (including myself), so it’s good to have the opportunity to clarify. Perhaps I should write a full post on the topic. But for now I will answer briefly:

          1. Regarding how the researchers did their analyses: All things are possible but I see no evidence that the researchers decided all their analyses before seeing the data, nor do I have any reason to believe that they did so, given that (a) they did different analyses for different parts of their study, (b) they found statistically significant p-values despite having very noisy data, and (c) it was standard practice (and continues to be standard practice, outside of preregistered studies) to decide on the analyses after seeing the data. If the raw data were available, it should not be too difficult to do a multiverse analysis and look at various possible results that could’ve been found with these data.

          But the real point here is that doing analysis after seeing the data is the default. It’s what I’ve done in almost every applied project I’ve ever worked on. It’s standard practice in just about every non-preregistered study out there. There’s nothing weird about me assuming that these researchers did what just about everybody else does!

          2. Actually, though, by talking about p-hacking or forking paths or whatever, we’re focusing on the wrong thing. Let’s put it this way: Suppose the researchers had done a preregistered analysis, deciding on all their coding and analyses before seeing their data. That’s fine, they could do that—it’s not really what I’d recommend, but they could do it—but, the point is, that would not make this into a good study. Had they preregistered, the most likely result is not that they’d’ve obtained these statistically significant results; rather, the most likely result is that they would not have obtained statistical significance, and they would’ve either had to report null results or dip into post-hoc analysis. Now, there’s nothing wrong with null results—good science can lead to null results—it’s just that the study wouldn’t really be advancing our understanding of psychology.

          To put it another way, this study was dead on arrival—or, I should say, dead before data collection. It’s just too noisy a study. The signal-to-noise ratio is too low. It would be like, ummm, here’s an analogy I’ve used before . . . suppose I decide to measure the speed of light by taking a block of wood and a match, weighing them, and then I use the match to set the wood on fire and I carefully measure all the heat released by the fire and I also very carefully collect all the ash and weigh it. I can then estimate the speed of light as c = sqrt(E/m), where E is the energy released and m is the loss of mass (the mass of the original wood and match minus the mass of the ash). In practice, this will not give me a useful estimate of c, because there’s too much noise in the measurements.

          What I’m saying is, the fundamental problem with this study is not the statistics, it’s the measurement, or, we could say, the measurement and the theory. So why talk about forking paths at all? I’m only talking about forking paths because the published p-values are presented as representing strong evidence. The point of forking paths is that this helps us understand how researchers can routinely obtain “statistically significant” p-values from such noisy data. In the period around 2010, Greg Francis, Uri Simonsohn, and other researchers wrote a lot about how this could happen. Here’s a particularly charming example from Nosek et al.

          3. Regarding the word “guilt”: I have no problem saying that the authors were doing bad science—or, to be more precise, using scientific procedures that had essentially no chance of improve our understanding—but I’d rather not call them “guilty” as if they committed a crime. Recall that honesty and transparency are not enough. And, to put it in contrapositive form, just cos someone did bad science, it doesn’t mean they were dishonest; it just means they were using methods that didn’t work. I’m not making “a serious allegation”; we’re not talking about fraud or anything; they were just unfortunately going about things wrong.

          You don’t have to be a bad person to do bad science, just as you don’t have to be a good person to do good science. Science is a product of individual researchers and research teams, and it’s also a product of society. It happens sometimes that a scientific subfield gets stuck in a bad place where bad science gets done. That’s too bad, and sometimes it takes awhile for the subfield to get unstuck. To say this is not to assign guilt or to make allegations, it’s just a description. See this paper with Simine Vazire for a discussion of how this happened in psychology.

          If we want to think in terms of meta-science, we could think of the null hypothesis significance testing paradigm as being useful in some settings and not in others. The key statistical point is that null hypothesis significance testing can work well in a high-signal, low-noise setting but not so well when signal is low and noise is high. Discussions of p-hacking and forking paths are separate from this concern. It’s important to understand p-hacking and forking paths, because these help us understand how researchers can consistently produce those magic “p less than 0.05” results, but then we need to step back and think more carefully about what is being studied and how it’s being measured.

          I didn’t say all this in the above post because I’ve said it in bits and pieces in other places over the years (for example, this paper from 2015). Not every blog post is self contained. So thanks again for giving me the opportunity to elaborate.

        • Your block of wood analogy would not just be noisy but biased, because mass would be lost due to smoke and CO2 leaving the burning block.

        • Adede:

          Yes, you’d have to capture all the smoke and CO2 too, and also measure the oxygen at the start. But, yeah, bias is part of noise. The noise in social science studies includes bias.

        • Thanks for the detailed reply. Here are a couple of followup comments/questions:

          “doing analysis after seeing the data is the default. It’s what I’ve done in almost every applied project I’ve ever worked on. It’s standard practice in just about every non-preregistered study out there. There’s nothing weird about me assuming that these researchers did what just about everybody else does!”

          You suggested above that they engaged in (possibly unintentional) p-hacking or forking-paths, so presumably you were saying they did more than just look at the data before analysis. In particular, it’s one thing to do the analysis after seeing the data, it’s another to meaningfully adjust your analysis approach based on the data and not make any mention of this in the study (particularly if the some potential approaches were passed over due to appearing less promising from the data). Are you really suggesting that it’s standard practice to inflate statistical significance with these practices, and not make any acknowledgement of it in the study (and that you do this in your own studies)?

          “To put it another way, this study was dead on arrival—or, I should say, dead before data collection. It’s just too noisy a study. The signal-to-noise ratio is too low… What I’m saying is, the fundamental problem with this study is not the statistics, it’s the measurement… I’m only talking about forking paths because the published p-values are presented as representing strong evidence.”

          Presumably if the effect size had been large enough they could have convincingly detected a stat sig effect even with this “noisy” small sample-size study, and it’s difficult to know what the effect-size is going to be before conducting a novel experiment, so I’m confused how this is a basis for calling it a “bad” study? On other hand, if they engaged in p-hacking or forking-paths then I can understand how that would be a concern, but it sounds like you’re saying these latter issues are not really your main objection to the study.

          “I’m not making “a serious allegation”; we’re not talking about fraud or anything; they were just unfortunately going about things wrong.”

          If a prominent researcher publicly accuses you of “bad science” and says your paper is “so bad”, then I’m surprised you don’t see that that’s a serious allegation, from the standpoint of the author’s reputation and status as a competent researcher. (even if it’s a bit mitigating that you’re saying these kinds of errors were more common back then).

          @jim: I noticed your reply just as I was about to post this comment. As for your selection-bias concerns, I’m not sure how big of an issue it was in this case, but the study would still be valuable even if it turned out that stereotype threat only applied to a particular subset of asian women (even if non-representative, they were still randomized into the experimental conditions). Also, I’m actually sympathetic to your prior that the effect is somewhat implausible, but that’s not really a basis for claiming the study is “bad science”.

        • Andy:

          1. When you write, “it’s one thing to do the analysis after seeing the data, it’s another to meaningfully adjust your analysis approach based on the data and not make any mention of this in the study . . .” Your mistake is to think there is a pre-existing “analysis approach” that would be adjusted. It’s more that there’s a general hypothesis that is not precisely tied to any specific hypothesis test or set of hypothesis tests. Recall that they also compare significance to non-significance, which is itself a statistical error.

          When you write, “Are you really suggesting that it’s standard practice to inflate statistical significance with these practices, and not make any acknowledgement of it in the study (and that you do this in your own studies)?”, again, you’re making the mistake of thinking there’s a pre-existing or Platonic “statistical significance” that can be inflated. A more accurate description is that there is a pile of data which the authors see, and then they do some analyses. Some of these analyses can be anticipated ahead of time; others can’t; but unless we are told otherwise I strongly doubt there’s any pre-analysis plan. To put it another way, if they had had a specific pre-analysis plan all decided before seeing the data, I think they would’ve acknowledged this in the study.

          In my own studies, yes, I gather data and then do analyses, which include some analyses I’d planned to do ahead of time and some new analyses, but there’s no sharp dividing line, and even the analyses that I’d anticipated doing are not precisely specified beforehand. There are lots of data coding decisions too. Again, I refer you to our multiverse study which is of somebody else’s paper but which lays out just some of the many many researcher degrees of freedom that come up with real data. I’m not inflating statistical significance with these practices, because statistical significance is not the product of my analyses; I’m not doing hypothesis tests.

          Yes, had the effect size been large enough the study could’ve been reasonable. But there’s no way the effect size could realistically have been this large. This is also a point discussed by Greg Francis, Uri Simonsohn, and others, that in the presence of truly large effect sizes we’d expect to see lots of tiny p-values such as 0.0001 etc, with occasional large p-values—not the range of p-values between 0.005 and 0.05 that typically show up in social science studies. This also comes up when people attempt to replicate such studies. Again, yes, given that these studies are hopelessly noisy, it would be better to see some preregistered replications to reveal the problems (I again point you to the Nosek et al. paper on 50 shades of gray), but the real problem here is a feedback loop of poorly-thought-through hypotheses, noisy designs for data collection, and data coding and analysis procedures that allow statistical significance from any data. That third step of the loop just provides encouragement for more poorly-though-through hypotheses, etc. There’s a reason there was a replication crisis in social psychology, and there’s a reason that the Simmons et al. paper and related work by others around that time were so important.

          2. You write, “If a prominent researcher publicly accuses you of “bad science” and says your paper is ‘so bad,’ then I’m surprised you don’t see that that’s a serious allegation, from the standpoint of the author’s reputation and status as a competent researcher.” Sure, I’ve published some mistakes. I’ve published 4 papers that were wrong enough that I published correction notices! One of these was a false theorem, another was a data coding problem that invalidated the empirical claims in the paper. The other two were smaller errors that had more minor effects on the claims in the papers. I wouldn’t say this makes me a competent researcher, but it does mean I made some mistakes that got published. It is what it is. The authors of the paper we’re discussing here had the bad luck to have been working in a fatally flawed research paradigm. It happens. To say this is not a condemnation of them. It’s like, ummm, I dunno, suppose there’s some subfield of cancer research studying a particular mechanism of cancer, and this subfield involves hundreds or thousands of researchers working for 20 years on this mechanism, and then it turns out that the whole thing was a dead end, that this particular phenomenon does not cause or influence cancer. For any given researcher I’d have to say, yeah, it’s too bad, but none of their ideas panned out, that they were working a nonexistent seam and there was actually no gold at all to be found on that particular mountain. But . . . that’s how science goes sometimes. In the grand scheme of things, they were doing their best. I’ve had some ideas that led nowhere. If this sort of dead end happens to represent a large chunk of someone’s career, them’s the breaks. I don’t condemn people for having this sort of bad luck. Not everyone can be at the right place at the right time when it comes to scientific progress. One of the reasons I’ve written so much on these topics is to help future researchers not get stuck in this way! And one reason I write about these older articles is to provide some historical perspective, as well as some encouragement that things are getting better. I do think it’s less likely nowadays that a top journal would publish a paper like this. Top journals make other errors (to see some examples, search this blog for “regression discontinuity”), but I do think that by recognizing these errors and looking at them carefully, we can help researchers do better and stop them from wasting so much of their time and effort.

        • 1. “Your mistake is to think there is a pre-existing ‘analysis approach’ that would be adjusted.”

          I think you misunderstood my point, since I wasn’t assuming there was necessarily a pre-existing analysis approach. For instance, even if you develop your analysis after seeing the data, if you informally try two different approaches/metrics and then rely on the one that showed a larger effect, then that’s a case of “adjust[ing] your analysis approach based on the data”, and you have some obligation to include mention of this in the study. And even if there was no formal significance test for this comparison, it can still effectively be a form of p-hacking. My sense is that this kind of thing was (and is) fairly common, and it’s a matter of degree whether it’s egregious enough to make the p-values worthless. In this case, it seems possible that they just tried the first analysis approach that seemed reasonable, or that they informally looked at a couple approaches but the results were quite similar (and for all we know they would have reported if one of the informal comparisons had been notably worse).

          Another example of this is the sex-ratio paper, where you accused the authors of “multiple comparison” errors, but as far as I can tell you didn’t seem to have any evidence that they committed this error (just a suspicion that they *might* have, since there were some potential degrees of freedom in their chosen metric). However, comparing the most beautiful subgroup with the remaining population seems like a pretty natural choice, so for all we know it was the only one they tried (or they informally tried a couple others and the results were quite similar). So is it really reasonable to criticize the study for having “statistical error[s]”, based on this kind of speculative objection? Again, it’s one thing to raise this as a general concern or possibility, but surely some more direct evidence is needed to lodge this allegation against a specific paper.

          Also, it’s confusing to me that on the one hand your post seemed to acknowledge that this kind of unintentional p-hacking is “bad” (even “when done in good faith”), but in your comments you almost seem to be suggesting that it’s completely standard practice (even for yourself) and not really intended as an objection to the paper.

          2. “you’re making the mistake of thinking there’s a pre-existing or Platonic “statistical significance” that can be inflated.”

          Again, I think you misunderstood the argument I was making. I’m not claiming a pre-existing “platonic” statistical significance. Rather, I’m suggesting that authors should strive to report sufficient detail about the comparisons that were performed (incuding informal ones), such that if the paper is carefully replicated to match the original study, the expected p-value distribution is similar. Whereas if you perform a large number of unreported comparisons (even informal ones) which distort the p-value distribution relative to what would be expected based on the methods section, then I think researchers have long been aware that this is problematic and effectively a kind of implicit p-hacking. Again, it’s a matter of degree how objectionable this is, but you seemed to be alleging that the authors did this to such an extreme degree in this paper that the p-values were basically meaningless.

          2. “Yes, had the effect size been large enough the study could’ve been reasonable. But there’s no way the effect size could realistically have been this large.”

          I don’t see how the authors could have known this prior to conducting the study, but even if it’s true that the effect size was somewhat unlikely to be big enough for stat sig, it could still have been useful to find a non-significant trend in the expected direction (which could have justified further research with a larger sample size). Also, assuming they didn’t commit multiple comparison errors (e.g. via substantial unreported forking paths), then they had sufficient sample size to detect modest effect sizes. So I really don’t see how small sample size is a fair basis for calling this “bad science”, and I’m surprised you seem to be focusing on this noisy-study/small-sample issue as your primary objection.

          3. “There’s a reason there was a replication crisis in social psychology”

          We are in agreement that these kinds of errors are common and contributed to the replication crisis; so I suspect the main source of disagreement is just to what extent these objections apply to these specific papers, and whether we should lodge accusations of “bad science” based on speculative theories about errors the authors might have committed. This seems particularly striking to me in the beauty sex-ratio paper, but also in the current post.

          4. “Sure, I’ve published some mistakes. I’ve published 4 papers that were wrong enough that I published correction notices!”

          Of course if there are objective errors in a paper it’s fair to point those out (e.g. comparing sig to non-sig). I was just pushing back against more speculative accusations, which seemed to be based in large part on the mere *possibility* or availability of forking paths, rather than direct evidence that the papers took advantage of them. In general, it seems like you are advocating for a norm of research-assessment that is going to cause many perfectly legitimate contemporary papers to be accused of bad science (e.g. due to *possible* forking paths rendering the p-values meaningless), even if the researchers are quite diligent about avoiding these issues (in light of widespread awareness of the replication crisis, etc). And in my observation, there is already an increasingly prevalent norm in which papers are unfairly accused of being low quality merely based on small sample size, even if they have stat sig results (and no concrete evidence of p-hacking or forking paths).

        • Andy:

          I have no reason to believe that anyone is adjusting their analysis. What I think they’re doing is gathering data, coding it, performing many analyses, drawing conclusions, and writing up the results—in that order. I don’t think it’s controversial for me to think that’s what people are doing. It’s how people do research! Yes, every now and then people do preregistered studies, and then they tell us about it. It seems very farfetched to think that this paper published in 1999 had a preregistered plan for coding/analysis/conclusions. This is not something that people were doing.

          In the sex ratio paper, I didn’t “accuse” the authors of anything. Their analysis (comparing level 5 to levels 1-4) would represent a multiple comparisons error, whether or not they did any other analyses. You can look at the data and see that this is the only large difference, and there’s no a priori reason to make this comparison. The authors don’t even try to offer such a reason; indeed, in other papers on the topic they used different measures. A linear regression would be much more natural.

          Again, the key problem with the sex ratio paper is not the multiple comparisons. The key problem is that the study is way way way underpowered. To detect a realistic effect size on the order of 0.1 percentage point difference in sex ratio, you’d need data from hundreds of thousands of births, not the 3000 that were in the study. The reason for talking about multiple comparisons at all is to help us understand how it is that those researchers were able to find statistically significant p-values from data that we know are essentially pure noise (in this case, we know that from straight mathematical arguments plus a basic knowledge of historical statistics on sex ratios). I don’t really care if they got p less than 0.05 by trying lots of analyses until they found something that worked; or if they informally looked at the data, saw an interesting pattern, and calculated the p-value, thus doing only one analysis; or if they had a preregistered plan to look at that particular comparison and for some reason they never informed anyone of this plan. For the goal of understanding sex ratios, it doesn’t matter because the data are essentially pure noise.

          Again, when people get statistically significant p-values from very noisy data, it makes sense to think that they did data coding and analyses contingent on the data. Especially in settings where the authors never claimed otherwise.

          Regarding the general question of how to handle multiple comparisons, my recommendation is not to preregister a single comparison, nor to pick the largest comparison and make a multiple comparisons correction. Rather, my recommendation is to look at all comparisons, fit them using a multilevel model, and display them graphically; see here and here. This is kind of separate from the problems with the papers discussed above, but I just wanted to let you know that I do have some positive recommendations here!

        • 1. “It seems very farfetched to think that this paper published in 1999 had a preregistered plan for coding/analysis/conclusions.”

          I explicitly said that this is *not* what I was assuming, e.g.: “I think you misunderstood my point, since I wasn’t assuming there was necessarily a pre-existing analysis approach.”

          2. “In the sex ratio paper, I didn’t “accuse” the authors of anything. Their analysis (comparing level 5 to levels 1-4) would represent a multiple comparisons error, whether or not they did any other analyses.”

          I don’t understand this claim. Suppose for simplicity that they only tried this one comparison, and never even considered comparing other permutations (e.g. perhaps the comparison with the extreme case seemed most promising to them, or they suspected non-linearities at the positive extreme). Are you really suggesting that this still would have been a multiple comparison error?

          It sounds like you think a linear regression would be more “natural”, but the allegation of a multiple comparison error would seem to imply that they at least informally checked other options without reporting doing so (or at least would have done so in some counterfactual scenario).

        • Andy:

          1. What I think was done in that 1999 paper is what is done in just about every other project without a preregistration plan, which is gathering data, coding it, performing many analyses, drawing conclusions, and writing up the results—in that order. That’s standard practice and I have no reason to think it wasn’t done in that particular paper.

          2. Yes, even if they only tried that one comparison in that sex ratio paper, I think it would a multiple comparisons problem, and I think that helps us understand how they obtained a statistically significant p-value from what is essentially pure noise.

        • 1. “What I think was done in that 1999 paper is what is done in just about every other project without a preregistration plan, which is gathering data, coding it, performing many analyses, drawing conclusions, and writing up the results—in that order.”

          I’m not disputing that. However, following this approach tells us fairly little about the extent to which the authors performed informal or unreported comparisons, which could have inflated the statistical significance (relative to what might be expected from the study description). And the degree to which this occurred is crucial to assessing your claim that the p-values in these specific papers are meaningless due to forking-paths, etc.

          2. “Yes, even if they only tried that one comparison in that sex ratio paper, I think it would a multiple comparisons problem”

          I’m pretty baffled by this, particularly since I specifically ruled out the counterfactual-tests loophole. Your view seems to effectively imply that we would need multiple-comparison corrections even if we only ever intended a single comparison, merely because there are related comparisons that some other researcher might find natural to perform.

          It would be one thing if you suggested that these degrees-of-freedom are evidence for *suspecting* unreported multiple comparisions, but I’m really surprised that you seem to be treating this error as basically unavoidable with the paper’s metric.

        • Andy:

          Ultimately what’s important for the science in these studies is what the data say, not what particular analyses the researchers happened to do or what particular path they took to get to those analyses. It doesn’t really matter to me whether the authors of a particular study did 100 analyses and picked the best one, or if they did 99 informal comparisons before deciding on their analysis, or if they didn’t do any informal comparisons but just looked at the data and started pulling out interesting things they found. Again, my concern is not to find the authors guilty of anything; I’m interested in what can be learned about psychology or biology from these data. And the answer is—just about nothing! Forking paths etc. are of interest not so much for scientific reasons here but rather for meta-science, to help us understand how the scientific community managed to fool itself.

        • Andy wrote:

          “but the study would still be valuable even if…”

          I strongly disagree. If I select 46 grains of sand from a beach with no criteria whatsoever, the analysis of any properties of those grains might yield accurate information about the beach. Or not. However, there’s no way to know, ever, if that information is accurate except to complete another more detailed study – rendering the initial study meaningless anyway.

          The low numbers alone doom the study. We already know the psychology of human beings is complex, so even a perfect result would be suspicious with such low numbers.

        • jim,

          More like analyzing 46 grains of sand that happened to get tracked home from the beach on my flip-flops. The only thing less illuminating than a tiny sample is a tiny convenience sample.

        • Andy:

          I was going to respond to this earlier but got called away. Andrew pointed out many things I was going to say that he would have said :). In particular:

          “So why talk about forking paths at all? I’m only talking about forking paths because the published p-values are presented as representing strong evidence. ”

          But Andrew is speaking in generalities and I’d like to highlight some of the specifics. First, we have a study that’s testing these hypothesis:

          1) when undergraduate Asian American females are reminded of the fact that they are female (and thus by implied stereotype bad at math) by some priming text, do they perform worse on math tests?

          2) when undergraduate Asian American females are reminded of the fact that they are Asian (and thus by implied stereotype good at math) by some priming text, do they perform better on math tests?

          There’s no indication of who these women are other that undergraduate Asian American women, nor where they come from, their academic background or anything else. Presumably, they’re somewhere in the upper two thirds of “ability” among undergraduate Asian American females (because they’re in college), but that’s just a guess. They could be all engineering students or all Asian American Studies majors. So there’s already huge potential for selection bias.

          That potential is amplified dramatically when we see how many of these undergraduate Asian American females participated in the various questions on the first “experiment”:

          “In the female-identity-salient condition, participants, (n = 14)”
          “In the Asian-identity-salient condition, participants (n = 16)”
          “In the control condition, participants (n = 16)”

          There were 16 or less undergraduate Asian American females in each prong of the experiment.

          They were tested by 12 problems / 20 min from the Canadian math competition for high school students. We have no information about the problems, or the relative ability of the women tested, even though they were asked to voluntarily provide their SAT math scores.

          Remember that we’re not trying to find out if undergraduate Asian American females perform better or worse on math tests. We’re trying to find out if they perform worse on math tests **because they were recently reminded of the fact that they are women**, and this reminding is presumed to automatically trigger some psychological response regarding stereotypes about women and math. And we’re trying to find out if they perform better on math tests **because they were recently reminded of the fact that they are Asian**, and this reminding is presumed to automatically trigger some psychological response regarding stereotypes about Asians and math.

          I personally believe this is a patently ridiculous idea. However, if the experimenters had a carefully selected thousands of study participants to reflect a cross-section of Asian women in American society; if they had tested them all independently first with a significant math test, then later retested them with priming, using a math test of the same difficulty, I might be interested in the results.

          As it stands, however, the results hardly matter. Such a complicated psychological phenomenon as response to stereotypes tested by an insignificant number of people by a basically unknown math test with no effort to select a representative cross-section of some particular group?

          I’m not that worried about judging the individuals. But would, for example, I stake millions of dollars on some policy that depended on this study being right? Noooo.

        • jim,

          In other words, it’s a research hypothesis not worth the effort of exploring on anything other than a no-cost convenience sample orders of magnitude too small to supply any meaningful evidence.

          I was vaguely aware in my undergraduate days that Psych classes routinely had to participate in little toy “studies” like this, presumably to illustrate the basics of performing experiments on human subjects. Until I started reading this blog I honestly had no idea such trivialities were actually published as though they were legitimate research.

        • ‘Psych classes routinely had to participate in little toy “studies” like this’

          Ha, funny you say that, I was going to suggest this “research” was probably done in a single 50 minute lecture session! :)

        • The policy is not depending on the study being right. The policy is predetermined, the ‘study’ is justification for what they want to do anyway.

        • Bob:

          Sometimes yes. Other times, if the data summary differently than expected, the researchers will draw the opposite conclusion. What’s important is not so much “X increases Y” but, rather, “X matters.” These scientists are not selling a treatment for Y, they’re selling X, where X is “evolutionary psychology” or “incentives matter” or “embodied cognition” or “nudge” or “ESP” or whatever. Any effect on anything in any direction is a win.

  8. Andrew,

    Since this is true

    > it’s easy to get apparently statistically significant p-values from pure noise

    how do I know if my study is actually telling me something real? If I try to regularize using a prior based on the literature, that won’t help much since the literature seems to think these effects are common and large. So how do I know when to believe my own study’s results?

    • Anon:

      High-quality data is a start. It’s tough to learn from noisy measurements. If there’s a lot of variability between people, you’ll want between-person comparisons or else you’ll need to gather lots of data. Beyond that, you can replicate your study under different conditions. “When to believe” is a continuum; the more you learn, the more you can modify your understanding.

      • > “When to believe” is a continuum; the more you learn, the more you can modify your understanding.

        Can you give some anchor points? To give an example: Suppose embodied cognition seems reasonable to me, like it did in 2010, and suppose I’ve got 100 20-year-olds to walk down the hallway slightly slower on average after speaking “Bingo/Florida/AARP”, what should I think?

        • Anon:

          The first step here is to realize that there will always be some difference; it will never be zero. The next step is to estimate the variation between people and to get a sense of how small an average difference is detectable given your sample size. In the example of the famous embodied cognition experiment, any true difference in speed would have to be very large to be detectable. The next step is to replicate; to study how the difference varies under different conditions. If you look carefully at the literature you’ll find that this was not really done; or, to be more precise, it was done and it was found that the result did not replicate, but many researchers did not realize that because they looked at a different interaction in each study. What they did was, for each study they started from the perspective that there was something statistically significant to be found, and then they found it. So one important step here is to be open to the possibility that the data are too noisy to tell us anything. That sounds pretty obvious, but many researchers don’t seem to be willing or able to take that step!

    • There is a confluence of factors that effect whether or not one admits new facts / beliefs / theories into their corpus of belief. Many of which are domain and situation specific so it would be impossible to sum it all up in these short comments! That said, a very common strategy employed in good science is to make “risky predictions”* and to concoct experiments to find out whether those predictions pan out. As more and more “risky” predictions result in success one naturally (and rationally) starts to believe that they really have something going with the theory or whatever else they are testing.

      An alternative preferred by many (most?) researchers in some fields: gathering some data for some vague reason (that has multiple paths to map to the data) and check for a direction of effect utilizing significance. Now, which one seems more persuasive to you: A) when you make risky predictions and confirm that they pan out in the lab or B) by a vague collection regime combined with a flimsy significance test for direction?

      The risky part being unlikely sans the theory or whatever you are testing. Note: this is covered in great detail by the writings of Paul E. Meehl. One of his best is here and I recommend it to you: https://meehl.umn.edu/sites/meehl.umn.edu/files/files/113theoreticalrisks.pdf

  9. Andrew:

    I’ve learned a lot from your posts on psychology’s reproducibility crisis and the shoddy methods that permeate so much of the psych literature. What would you say to students who hope to become productive psych researchers without making the same mistakes as people like Cuddy, Kanazawa, and the many researchers who routinely produce papers like this one? Is a thorough schooling in stats the best preparation for hopeful future scientists? The message I’ve taken away from what I’ve learned about bad science can be summed up as “Don’t do what those guys did.” What would you add to that?

  10. Wow, major deja vu. Having been in the building where it happened, and vividly remembering liking this paper when it came out, let me begin by saying two nice things. First, if stereotype threat is a “thing”, then the idea to take a population for whom it is possible to activate two cultural stereotypes, leaning in opposite performance directions, is genuinely clever. It’s about 25 years ago, but I still remember hearing they were working on it and thinking what a great idea. Second, in terms of the statistical analysis, notice the dedicated reporting of effect sizes for every test. The training in the psych department was led by Rosenthal, and the weekly Rosenthal/Rubin Stats lunch was popular. Effect size, effect size, effect size was emphasized, a progressive stance in the mid-nineties when you could still get away with basically only putting (p < .05) after every claim in your paper.

    But yeah…….reading it in 2021, it's a highly optimistic presentation of relatively weak data. The idea was clever and the data "kind of sort of" supported it, and that was enough to bundle it up and send it off. The flaws seem awfully transparent and obvious now. For some historical perspective, however, check out the APA Task Force Guidelines from 1999 (Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54(8), 594–604). Read it to understand the culture of the time. It would be interesting to debate whether the stereotype paper adequately tried to respond to the Task Force proposals (which were circulating).

    Perhaps a more interesting debate is whether the 1999 Task Force proposals sufficiently addressed the statistical issues that would drive the replication crisis that exploded into view 10 years later. There's a case for yes, and a case for no. The Task Force gave great advice, but also advocated for researcher flexibility under the assumption that science would be self-correcting with regard to erroneous claims.

  11. Hi Andrew,

    Psychological Science is still in bad shape when it comes to certain areas. For example, considering what we know about “priming” research, check out this replication of “accuracy primes” to counter misinformation on social media. It seems a very large sample was needed to replicate the priming effect with still quite a big p-value (tiny effect). The Bayesian analysis seems more informative. I guess it’s good that the journal Psychological Science now also publishes replications of earlier findings that may have been dubious?

    https://journals.sagepub.com/doi/full/10.1177/09567976211024535

Leave a Reply to Daniel Lakeland Cancel reply

Your email address will not be published. Required fields are marked *