Bruised and battered, I couldn’t tell what I felt. I was ungeneralizable to myself.

One more rep.

The new thing you just have to read, if you’re following the recent back-and-forth on replication in psychology, is this post at Retraction Watch in which Nosek et al. respond to criticisms from Gilbert et al. regarding the famous replication project.

Gilbert et al. claimed that many of the replications in the replication project were not very good replications at all. Nosek et al. dispute that claim.

And, as I said, you’ll really want to read the details here. They’re fascinating, and they demonstrate how careful the replication team really was.

When reading all this debate, it could be natural as an outsider to want to wash your hands of the whole thing, to say that it’s all a “food fight,” why can’t scientists be more civil, etc. But . . . the topic is important. These people all care deeply about the methods and the substance of psychology research. It makes sense for them to argue and to get annoyed if they feel that important points are being missed. In that sense I have sympathy for all sides in this discussion, and I don’t begrudge anyone their emotions. It’s also good for observers such as Uri Simonsohn, Sanjay Srivastava, Dorothy Bishop, and myself to give our perspectives. Again, there are real issues at stake here, and there’s nothing wrong—nothing wrong at all—with people arguing about the details while at the same time being aware of the big picture.

Before sharing Nosek et al.’s amazing, amazing story, I’ll review where we are so far.

Background and overview

As most of you are aware (see here and here), there is a statistical crisis in science, most notably in social psychology research but also in other fields. For the past several years, top journals such as JPSP, Psych Science, and PPNAS have published lots of papers that have made strong claims based on weak evidence. Standard statistical practice is to take your data and work with it until you get a p-value of less than .05. Run a few experiments like that, attach them to a vaguely plausible (or even, in many cases, implausible) theory, and you got yourself a publication. Give it a bit more of a story and you might get yourself on Ted, NPR, Gladwell, and so forth.

The claims in all those wacky papers have been disputed in three, mutually supporting ways:

1. Statistical analysis shows how it is possible—indeed, easy—to get statistical significance in an uncontrolled study in which rules for data inclusion, data coding, and data analysis are determined after the data have been seen. Simmons, Nelson, and Simonsohn called it “researcher degrees of freedom” and Eric Loken and I called it “the garden of forking paths.” It’s sometimes called “fishing” or “p-hacking” but I don’t like those terms as they can be taken to imply that researchers are actively cheating.

Researchers do cheat, but we don’t have to get into that here. If someone reports a wrong p-value that just happens to be below .05, when the correct calculation would give a result above .05, or if someone claims that a p-value of .08 corresponds to a weak effect, or if someone reports the difference between significant and non-significant, I don’t really care if it’s cheating or just a pattern of sloppy work.

2. People try to replicate these studies and the replications don’t show the expected results. Sometimes these failed replications are declared to be successes (as in John Bargh’s notorious quote, “There are already at least two successful replications of that particular study . . . Both articles found the effect but with moderation by a second factor” [actually a different factor in each experiment]), other times they are declared to be failures (as in Bargh’s denial of the relevance of another failed replication which, unlike the others, was preregistered). The silliest of all these was Daryl Bem counting as successful replications several non-preregistered studies which were performed before his original experiment (anything’s legal in ESP research, I guess), and the saddest, from my perspective, came from the ovulation-and-clothing researchers who replicated their own experiment, failed to find the effect they were looking for, and then declared victory because they found a statistically significant interaction with outdoor temperature. That last one saddened me because Eric Loken and I repeatedly advised them to rethink their paradigm but they just fought fought fought and wouldn’t listen. Bargh I guess is beyond redemption, so much of his whole career is at stake, but I was really hoping those younger researchers would be able to break free of their statistical training. I feel so bad partly because this statistical significance stuff is how we all teach introductory statistics, so I, as a representative of the statistics profession, bear much of the blame for these researchers’ misconceptions.

Anyway, back to the main thread, which concerns the three reasons above why it’s ok not to believe in power pose or so many of these other things that you used to read about in Psychological Science.

Here’s the final reason:

3. In many cases there is prior knowledge or substantive theory that the purported large effects are highly implausible. This is most obvious in the case of that ESP study or when there are measurable implications in the real world, for example in that paper that claimed that single women were 20 percentage points more likely to support Obama for president during certain times of the month, or in areas of education research where there is “the familiar, discouraging pattern . . . small-scale experimental efforts staffed by highly motivated people show effects. When they are subject to well-designed large-scale replications, those promising signs attenuate and often evaporate altogether.”

Item 3 rarely stands on its own—researchers can come up with theoretical justifications for just about anything, and indeed research is typically motivated by some theory. Even if I and others might be skeptical of a theory such as embodied cognition or himmicanes, that skepticism is in the eye of the beholder, and even a prior history of null findings (as with ESP) is no guarantee of future failure: again, the researchers studying these things have new ideas all the time. Just cos it wasn’t possible to detect a phenomenon or solve a problem in the past, that doesn’t mean we can’t make progress: scientists do, after all, discover new planets in the sky, cures for certain cancers, cold fusion, etc.

So if my only goal here were to make an ironclad case against certain psychology studies, I might very well omit item 3 as it could distract from my more incontestable arguments. My goal here, though, is scientific not rhetorical, and I do think that theory and prior information should and do inform our understanding of new claims. It’s certainly relevant that in none of these disputed cases is the theory strong enough on its own to hold up a claim. We’re disputing power pose and fat-arms-and-political-attitudes, not gravity, electromagnetism, or evolution.

Putting the evidence together

For many of these disputed research claims, statistical reasoning (item 1 above) is enough for me to declare Not Convinced and move on, but empirical replication (item 2) is also helpful in convincing people. For example, Brian Nosek was convinced by his own 50 Shades of Gray experiment. There’s nothing like having something happen to you to really make it real. And and theory and prior experience (item 3) tells us that we should at least consider the possibility that these claimed effects are spurious.

OK, so here we are. 2016. We know the score. A bunch of statistics papers on why “p less than .05” implies so much less than we used to think, a bunch of failed replications of famous papers, a bunch of re-evaluations of famous papers revealing problems with the analysis, researcher degrees of freedom up the wazz, miscalculated p-values, and claimed replications which, when looked at carefully, did not replicate the original claims at all.

This is not to say that all or even most of the social psychology papers in Psychological Science are unreplicable. Just that many of them are, as (probabilistically) shown either directly via failed replications or statistically through a careful inspection of the evidence.

Given everything written above, I think it’s unremarkable to claim that Psychological Science, PPNAS, etc., have been publishing a lot of papers with fatal statistical weaknesses. It’s sometimes framed as a problem of multiple comparisons but I think the deeper problem is that people are studying highly variable and context-dependent effects with noisy research designs and often with treatments that seem almost deliberately designed to be ineffective (for example, burying key cues inside of a word game; see here for a quick description).

So, I was somewhat distressed to read this from a recent note by Gilbert et al., taking no position on whether “some of the surprising results in psychology are theoretical nonsense, knife-­edged, p-­hacked, ungeneralizable, subject to publication bias, and otherwise unlikely to be replicable or true” (see P.S. here).

I could see the virtue of taking an agnostic position on any one of these disputed public claims: Maybe women really are three times more likely to wear red during days 6-14 of their cycle. Maybe elderly-related words really do make people walk more slowly. Maybe Cornell students really do have ESP? Maybe obesity really is contagious? Maybe himmicanes really are less dangerous than hurricanes. Maybe power pose really does help you. Any one of these claims might well be true: even if you study something in such a noisy way that your data are close to useless, even if your p-values mean nothing at all, you could still have a solid underlying theory and have got lucky with your data. So it might seem like a safe position to keep an open mind on any of these claims.

But to take no position on whether some of these “surprising results” have problems? That’s agnosticism taken to a bit of an extreme.

If they do take this view, I hope they’ll also take no position on the following claims which are supported just about as well from the available data: that women are less likely to wear red during days 6-14 of their cycle, that elderly-related words make people walk faster, that Cornell students have an anti-ESP which makes them consistently give bad forecasts (thus explaining that old hot-hand experiment), that obesity is anti-contagious and when one of your friends gets fat, you go on a diet, etc.

Let’s keep an open mind about all these things. I, for one, am looking forward to the Ted talks on the “coiled snake” pose and on the anti-contagion of obesity.

The new story

OK, now you should go here and read the story from Brian Nosek and Elizabeth Gilbert, (no relation to the Daniel Gilbert of “Gilbert et al.” discussed above). They take one of the criticisms of Gilbert et al. who purported to show how unfaithful one of the replicated studies was, and carefully and systematically describe the study, the replication, and why the criticism of Gilbert et al. was at best sloppy and misinformed, and at worst a rabble-rousing, misleading bit of rhetoric. As I said, follow the link and read the story. It’s stunning.

In a way it doesn’t really matter, but given the headlines such as “Researchers overturn landmark study on the replicability of psychological science” (that was from Harvard’s press release; I was going to say I’d expect better from that institution where I’ve studied and taught, but it’s not fair to blame the journalist who wrote the press release; he was just doing his job), I’m glad Nosek and E. Gilbert went to the trouble to explain this to all of us.

P.S. I’m about as tired of writing about all this as you are of reading about it. But in this case I thought the overview (in particular, separating items 1, 2, and 3 above) would help. The statistical analysis and the empirical replication studies reinforce each other: the statistics explains how those gaudy p-values could be obtained even in the absence of any real and persistent effect, and the empirical replications are convincing to people who might not understand the statistics.

P.P.S. I just noticed that the Harvard press release featuring Gilbert et al. also says that “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.”

100%, huh? Maybe just to be on the safe side you should call it 99.9% so you don’t have to believe that the Cornell ESP study replicates.

What a joke. Surely you can’t be serious. Why didn’t you just say “Statistically indistinguishable from 200%”—that would sound even better!

50 thoughts on “Bruised and battered, I couldn’t tell what I felt. I was ungeneralizable to myself.

  1. Hi Andrew,

    I think what’s gotten lost in all this is where Gilbert et al. and Nosek at al. (OSF) substantially agree and disagree. In correspondence on twitter, I’ve learned that Dan Gilbert at least agrees that the OSF project shows that many of the targeted studies do not replicate. There are some questions about how faithful the individual replications are, but there’s not much disagreement about that main point. I think OSF agree with that reading.

    OSF also argue, however, that they have an estimate of the reproducibility of psychology papers in general — their title is “Estimating the reproducibility of psychological science.” The main argument from Gilbert et al. is that, since OSF did not take a random sample of psych papers, they do not have an estimate of the reproducibility rate of the typical psych paper. I agree with Gilbert et al. on this, and I haven’t seen any serious counterarguments.

    • Dan:

      To play the Devil’s advocate I’ll say the following: Gilbert et al, in their latest reply say that generalizing from a non-random sample is wrong and that this is “inarguable”. Now, I agree there are big problems with this type of generalization but the Devil’s advocate in me disagrees it is “inarguable”.

      1. Almost all information is informative to some extent. A silly example. “The population of swans is white”. I take a convenience sample from Australia and find 90% of swans are black. Surely I have learned something about the population. Namely, that the proportion of white swans is less than 100 per cent. (I would be wrong to conclude that only 10 percent of swans are white, but correct to conclude the proportion is less than 100 percent). Admittedly this is an extreme example but it is an argument.

      2. Random sampling is sufficient but not necessary for inference about a population. Suppose we are studying the impact of aspirin on headaches. Suppose eye color has no impact whatsoever on the effect of aspiring on headache. Suppose we study this impact using a convenience sample of people with blue eyes. Given our assumptions the selection is uninformative. The convenience sample is as good as random for studying this causal relation — no statistical adjustment needed. Again, this is an extreme example based on very strong assumptions but it is an argument.

      Heuristically, I feel these sorts of reasons explain why psychologists, of all people, are happy building an massive empirical edifice on convenience samples of psychology students. Either this, or we throw all these studies in the dustbin of history alongside Nosek et al, as Gilbert et al seem to propose (or, at the very least, say the edifice is only about a bunch of students at a particular place and time, in which case I want my tax dollars back).

      Moreover, as a scientist, and as a matter of principle, I dislike absolute terms like “inarguable”.

      PS Gilbert et al also claim the differences in “fidelity” are “inarguable”. Sure, there are differences but the questions is whether these are relevant. Evidently, this is arguable.

    • The sampling procedure is well specified and (for a high level view: page 10 of https://osf.io/9h47z/ ), and while not uniformly at random, it is possible to construct alternative weighting schemes to get a sense of the sensitivity to this.
      However, unless there is a identified potential source of bias in the sampling frame, I don’t see why not use the uniform prior. I have not heard any constructive proposals in this reward: specify what such a bias might be and then look at the data and try to estimate it. Simply stating that it is not uniform is not a comment that seems worth responding to in much detail.
      Having attempted to help in estimating the effects in replications that where about to drop out of the sample frame due to a difficulty in reproducing the analysis methods, there was no clear reason to expect the bias to go one way or another in the examples I encountered. There are of course sources of such biases that have not been considered, and it would be very interesting to explore this.

      • Hi Nicolas,

        You’re right. I misunderstood the sampling procedure. And though it’s not literally random, I agree with you that it seems close to uniform, and so the study probably gives a reasonable estimate of the reproducibiility of papers published in the targeted journals in 2008. (I had thought that the OSC selected the most egregious studies for replication, but that’s not true at all.)

  2. A part of this problem is the lack of external stakeholders. A lot of these papers invoke grimaces and wariness when read by smart professionals *outside* the field.

    But so long as the people who sit on review committees, attend seminars and take funding and publication decisions are all substantially “insiders” it gets harder to notice that things are wrong because: (a) you see them every day and internalize them as routine and (b) because you have a conflict of interest.

    Not saying Psych is unique in having this “insider problem” but it is one of the reasons why things got so bad in Psych I feel.

    • +1
      Do you agree that ‘External stakeholders’ might include patients, families of patients, or caregivers in a hospital? Who else? Commodity consumers?

  3. Also, Manski should be on everyone’s reading list. If half the policy people slaving in government departments who wannabe Josh from the West Wing that read “The signal & the noise” had read Manski instead, we’d be in a very different world.

    And while I’m here, here’s a fine example of getting absolutely worthless exploratory speculation published in a major journal by being cutesy: https://twitter.com/robertstats/status/705464901611945984

  4. “I’m about as tired of writing about all this as you are of reading about it.”

    I’m very, very glad that you are beating the drums on this topic. Seems the scientific revolution isn’t finished yet — we’re still learning how to do science properly under some difficult circumstances. Thanks to people like you I feel optimistic that we’ll get there. A hundred years from now people may look back at this time as a crucial period in the development of the scientific method.

  5. (that was from Harvard’s press release; I was going to say I’d expect better from that institution where I’ve studied and taught, but it’s not fair to blame the journalist who wrote the press release; he was just doing his job)

    Just following orders as he lit up the ovens at Auschwitz–not fair to blame that individual. For future reference, someone who puts out false information is not a journalist but rather a PR flack (Columbia’s school of journalism would not be happy with you). Why an academic institution would feel the need to employ such raises questions about that institution’s adherence to its motto–Veritas should perhaps be changed to Plausibilis, or maybe Non Refuto.

    • Numeric,

      Generally I agree with you (see here) but in this case I’d give the P.R. guy a break. The question was technical, and I assume the writer of the press release was naively thinking that two tenured professors at Harvard must know what they’re talking about on a technical matter.

      • Maybe. Reminds me of the story of a man sitting in a remote railway station and asking the stationmaster whether the train he was waiting for was on time. The stationmaster said it was. Time passed and it was clear that that the train was quite late so he accosted the stationmaster and asked him why he told him the train was on time when it was clearly late. The stationmaster looked at him and replied “Son, I’m not here to knock the railroad”.

        “It is difficult to get a man to understand something, when his salary depends upon his not understanding it!”–Upton Sinclair

        • Numeric:

          Yes, that sounds about right—but in this case I really doubt the writer of the press release has any idea he made a mistake. He’s probably not reading this blog, and even if someone does point him here, I doubt he has the inclination or training to evaluate the claim. At best, he’d go back to Gilbert et al. and ask whassup, and since they don’t seem to understand the statistics, it’s all hopeless.

          But, sure, this guy also has no motivation to learn that he made a mistake. So I suspect he’ll remain in blissful ignorance. To map it to your story, he has no idea if the train is late, or if it came early, or if it came on time. He has no idea where the track is. He’s sitting in an office 10 miles away, filing reports.

  6. In some ways, I feel as though Nosek et al. and Gilbert et al. are talking past each other. They both acknowledge that there are serious issues in psychology (and let’s be honest, other scientific fields, too). The fight over the “true” estimate of replicability in the psychology seems to detract from the take-home message that Andrew and others have been making for some time about the importance of measurement, problems with p-values and the garden of forking paths, etc.

    Nosek et al.’s major contribution is raising awareness of the difficulties they faced in replicating results from popular (some might even say famous) published studies, as well as making replication a more accepted pursuit (instead of a relegating it to a graduate-level assignment). Gilbert et al. are right that some failed replications are insufficient evidence of a replicability crisis (there could be one, we just don’t know with the available data).

    In short, let’s stay focused on the key issue at stake–how to make sense of what the available data is telling us about phenomena in psychology, politics, sociology, medicine…or whatever we’re interested in understanding.

    • Thanks for this comment.

      A quick read of Gilbert et al’s point 3 in their March 7 reply about challenges/problems of evaluating replication suggested to me “we just don’t know with the available data”.

      But then there was this comment from Andrew “he’d go back to Gilbert et al. and ask whassup, and since they don’t seem to understand the statistics, it’s all hopeless.”

      Now, I will have to decide how much time I want to invest in this “talking past each other” …

      I do worry about the conflating of – is there consistency of evidence in both studies (e.g. prior probabilities re-weighted similarly or ranks of likelihoods similar) versus was the initial author correct in their assessment of uncertainty or decision to draw others attention to the finding (what level of CI overlap enough??) versus generalize-able enough over what degree of change in fidelity (assessed by??) versus acceptable level of risk of truly false claims….

  7. What problems does Gilbert acknowledge? He is claiming a 100% replicability rate, which is mindblowingly ludicrous. It is also not possible to both have an abuse of p-value hypothesis testing AND not have a problem with replication at a far greater rate than Gilbert is claiming, which again he is claiming is ZERO. It is one thing to point out problems with the replication tests, it is another to engage in an utter denial of reality.

    • And the problem is exacerbated when someone attempts to convince everyone else that their pink skies and blue ponies are real and that logical reasoning and evidence be damned.

      • Look, I agree that a claim of 100% replicability is silly and not in keeping with what we know about sampling from a population. I don’t have access to the press release (the server is down at the moment), but Gilbert et al. don’t claim a failure rate of zero in their Science article or their post-publication response.

        That being said, the issue isn’t really about replication per se, anyway. What difference does it make if some of the junk science studies that Andrew regularly blogs about were replicable? Would that make us feel better? Probably not. Instead, what we really want is triangulation–different methods and resulting data that point to the same general conclusion. If we design sound studies to test some of these “junk” studies, I don’t necessarily care that the exact same procedures are followed, just that best practices are in place in terms of measurement, stimulus materials, samples, and analysis. I’m fairly confident that most of these studies would fail on this point.

    • re: 100% claim

      The press release pdf is here
      http://projects.iq.harvard.edu/files/psychology-replications/files/harvard_press_release.pdf?m=1456973687

      The summary section of the press release, attributed to a Harvard Staff writer, says “the study actually shows that the replication rate in psychology is quite high – indeed , it is statistically indistinguishable from 100%”

      Which I suspect is the result of flawed attempt to summarize this segment from the longer press release text: “The methods of many of the replication studies turn out to be remarkably different from the original s and, a ccording to Gilbert, King, Pettigrew, and Wilson , these “infidelities” had two important consequences . First, the y introduced statistical error into the data w hich led the OSC to significantly underestimate how many of their replications should have failed by chance alone . When this error is taken into account, the number of failures in their data is no greater than one would expect if all 100 of the original findings had been true.”

      Now if the researchers did read and ok the press release before it was released then this falls back on them. (If I was the researcher and knew the study was bound to get a lot of eyeballs I would certainly insist on fact check the press release from my own university before release.)

        • Do they intentionally misrepresent Nosek et al’s arguments, evidence, and positions? Or are they just ignorant of their lack of understanding of statistics?

  8. Here’s a question:

    Some social science studies don’t replicate because the purported effect never existed. Other studies don’t replicate because human behavior sometimes varies over time and/or space.

    What are some clues for distinguishing the former from the latter?

  9. “P.P.S. I just noticed that the Harvard press release featuring Gilbert et al. also says that “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.””

    If the sample mean is 90% and the standard error is 20% and we use a linear model, then the replication rate would be statistically indistinguishable even from 125%!

  10. The economists appear to be getting on the replication bandwagon (http://www.economist.com/news/science-and-technology/21693904-microeconomists-claims-be-doing-real-science-turn-out-be-true-far)

    “But as economics adopts the experimental procedures of the natural sciences, it might also suffer from their drawbacks. In a paper just published in Science, Colin Camerer of the California Institute of Technology and a group of colleagues from universities around the world decided to check. They repeated 18 laboratory experiments in economics whose results had been published in the American Economic Review and the Quarterly Journal of Economics between 2011 and 2014.

    For 11 of the 18 papers (ie, 61% of them) Dr Camerer and his colleagues found a broadly similar effect to whatever the original authors had reported. That is below the 92% replication rate they would have expected had all the original studies been as statistically robust as the authors claimed—but by the standards of medicine, psychology and genetics it is still impressive.”

    But what qualifies as “a broadly similar effect”?

  11. A partial answer from the abstract to the Science article:

    “We find a significant effect in the same direction as the original study for 11 replications (61%); on average the replicated effect size is 66% of the original. The reproducibility rate varies between 67% and 78% for four additional reproducibility indicators, including a prediction market measure of peer beliefs.”

  12. This whole business is a nightmarish joke.

    If psychological science wants to be a science, their work needs to be quantifiable (at least in the Galilean view of science that persist to this day — some views of “science” that fall outside the orthodoxy).

    In what sense do most (not all) scientific constructs submit to quantification? To use Gilbert’s cash cow, what is “happiness”. How is happiness quantified (forget how it is conceptualized — I will not live long enough to cover that topic!). What is (are) “units” of happiness (assuming the construct has a clear sense and a clear reference)? Where do those units fall on, say, Steven’s (1946) scales of measurement? Do two units of happiness sum to a mathematically meaningful and computationally tractable (interval, ratio?)”two” on a happiness scale (LOL)?

    Read Mitchell (1999) to see why psychological use of mathematical description is professional embarrassment. We do not have concepts; we have marks on Likert scales. And we have not even touched the logically prior (note: misuse of the word “prior”) issue of what happiness IS, What (and where) is the substantial theory that unites happiness behavior(s) into a conceptually meaningful category. Is it a natural kind or a pscyholoigcal kind (e.g., Danziger, 1997). Etc., etc., etc.

    Until these issues are addressed (or even recognized as needing sustained attention) we will continue to be awash in “data” lacking a clear role in the scientific process.

    It is embarrassing to be associated with a field that insists on being taken as a science when that claim largely consists in the facile equivalence of “method = science” — thereby conflating necessity with sufficiency.

  13. I am a typo generator. Strike the second use of the word “that” from the first parenthetical comment in the previous post. I am reasonably sure I missed others.

      • Not to be a total dork — but what is a FB page? Face book?

        I do not have or use that social app (or any for that matter — save email and occasional [i.e., this one] posts) so I am unsure what exactly PB refers to.

        I don’t care where you post it. I take guidance here from a great line from the movie Mr. Majestyk. In one scene, Charles Bronson (Majestyk) meets the villain Frank Renda (Salazo from the Godfather) in a diner. Renda has been abusing Bronson the entire film (over the rights to grow and sell melons!). Bronson looks at Renda who is sitting at a table in the diner. Bronson suddenly slugs him in the mouth and says “No use trying to get on your good side”.

        That line pretty much sums up my feelings about academic psychology.

        • Yes. I consider it an informed attribute. But my family — particularly my wife — fails to find any wisdom in my ways. This is exacerbated by my refusal to get a cell phone. Luddite though and through.

Leave a Reply

Your email address will not be published. Required fields are marked *