Well, today we find our heroes flying along smoothly…

Posted on September 24, 2024 12:41 PM by Jessica Hullman

This is Jessica. I hadn’t planned to be down on open science research again so soon, but I seem to keep finding myself presented with messes associated with it. After an 7+ month investigation instigated by a Matters Arising critique by Bak-Coleman and Devezer, Nature Human Behavior retracted the “feel-good open science story” paper “High replicability of newly discovered social-behavioural findings is achievable” by Protzko et al. From the retraction notice:

The concerns relate to lack of transparency and misstatement of the hypotheses and predictions the reported meta-study was designed to test; lack of preregistration for measures and analyses supporting the titular claim (against statements asserting preregistration in the published article); selection of outcome measures and analyses with knowledge of the data; and incomplete reporting of data and analyses.

This is obviously not a good look for open science. The paper’s authors include the Executive Director of the Center for Open Science, who has consistently advocated for preregistration because authors pass off exploratory hypotheses as confirmatory. Another author is a member of the Data Colada team that has outed others’ questionable research transgressions and helped popularize the ideas that selective reporting and harking threaten the validity of claimed results in psych.

I once thought I did know all about it

If seeing this paper retracted makes you uncomfortable, I don’t blame you. It makes me uncomfortable too. My views on mainstream open science research and advocacy were much more positive a year ago before I encountered all this.

As a full disclosure, late in the investigation I was asked to be a reviewer, probably because I’d shown interest by blogging about it. Initially it was the extreme irony of this situation that made me take notice, but after I started looking through the files myself I’d felt compelled to post about all that was not adding up. When asked to officially participate in the investigation, I agreed but with some major hesitation. I knew that to be comfortable weighing in on the question of retraction, I’d want to think through many possible defenses for how the paper presents its points. That would mean spending more time beyond that I’d already spent going through the OSF to write one of my blog posts to sort through the paper’s arguments and consider whether they could possibly hold up. None of this is at all connected my main gig in computer science.

But ultimately I said yes out of a sense of duty, figuring that as an outsider to this community with no real alliances with the open science movement or any of the authors involved, it would be relatively easy for me to be honest.

The final version of the Matters Arising, now published by the journal, summarizes a number of core issues: the lack of justification, given the study design and missing pre-registration, for implying a causal relationship or even discussing an association between rigor-enhancing practices and the replicability rate the authors observe; the inconsistencies between the replicability definition and those in the literature; the over-interpretation of the statistical power estimate, etc. Hard to get beyond this barrage of points.

Since the rain falls, the wind it blows, and the sun shines

What’s funny though is that I somehow still had sort of expected this to be a difficult call. Maybe I was susceptible to the tendency to want to give such esteemed authors, several of whom have done some work I really respect, the benefit of the doubt. I was obviously aware going into the investigation about the lack of preregistration for the main analyses that they claimed to have preregistered. But I tried to have an open enough mind that I wouldn’t miss any possible value that the paper could still have for readers despite that flaw.

Unfortunately, as I re-read the Protzko et al. paper to consider what, if any, one could learn about the role of rigor-enhancing practices to their results, I quickly found myself unable to resolve a fundamental issue related to how they establish that the replication rate they observe is high in the first place. The reference set of effects they mean when they use terms like “original discoveries” is not consistent throughout the paper, including in their calculations of expected power and replicability, which they use to establish their claim of “high replicability.” Sometimes these refer to effects from the pilots and sometimes used to refer to effects from the confirmatory studies. As a result of the way the authors set up their claims, referring to rigor-enhancing practices characterizing the whole process, they would need the rigor-enhancing practices to apply to both the confirmatory studies and the pilot studies.

But the paper text and other materials contradict themselves about how the practices apply across these two sets of studies. For example, I spent some time looking for the pilot preregistrations (which the paper also claims exist), but found only a handful, suggesting that the paper also can’t back up its claims about preregistration there. Given this contradiction between what they say about their design (and the lack of info on the pilots) and the logic they set up to make one of their central points, I didn’t see the paper could redeem itself, even if we decide to be optimistic about the other issues. Retraction was clearly the right decision. You can read some comments related to what I wrote in my review here.

What I still don’t get is how the authors felt okay about the final product. I encourage you to try reading the paper yourself. Figuring out how to pull an open science win out of the evidence they had required someone to put some real effort into massaging the mess of details into a story. It was frustrating as a reader of the paper trying to match the reported values to the set of effects or processes they used. The term bait-and-switch came to mind multiple times as I tried to trace the claims back to the data. Reading an academic paper (especially one advocating for the importance of rigor) shouldn’t remind one of witnessing a con, but the more time I spent with the paper, the more I was left with that impression. It’s worth noting that the lack of sufficient detail about the pilots was brought up at length in Tal Yarkoni’s review of the original submission, as well as Malte Elson’s review for NHB. The authors were made aware of these issues, and made a choice not to be up front about what happened there.

It is true that everyone makes mistakes, and I would bet that most professors or researchers can relate to having been involved in a paper where the story just doesn’t come together the way it needs to, e.g., because you realized things along the way about the limitations of how you set up the problem for saying much about anything. Sometimes these papers do get published, because some subset of the authors convinces themselves the problems aren’t that big. And sometimes even when one sees the problems, it’s hard to back out for peer pressure reasons.

But even then, there’s still a difference between finding oneself in such a situation and crowing all over the place about the paper as if it is a piece of work that delivers some valuable truth. What’s puzzled me from the start is that this paper was not only published, it was widely shared by the authors as a kind of victory lap for open science.

Don’t you know that your creator is running out of ideas

So while I came into this whole experience relatively open-minded about open science, my views have been colored less positively after learning about this paper and seeing certain other open science advocates defend it. I personally stopped seeing the value of most behavioral experiments a few years ago, because I could no longer get beyond the chasm between the inferences we want to draw and the processes we are limited to when we design them. But I guess I interpreted this as more of a personal tic. Preregistration, open data and methods, better power analysis etc. practices might not be enough to make me feel excited about behavioral experiments, but I assumed that the work open science advocates were doing to encourage these practices was doing some good. I hadn’t really considered that open science could be doing harm, beyond maybe encouraging a different set of rigor signalling games.

This experience has changed my view, from ”live and let live if people find it helpful” to “this is not helpful,” given that producing evidence to change policy (or logical justifications presented as sufficient for policy without empirical evidence) appears to be a goal of open science research like this. Preregister if you find it helpful. Make your materials open because you should. But don’t expect these practices to transform your results into solid science, and don’t trust people that try to tell you it’s as easy as adopting a few simple rituals. I’m now doubtful that the flurry of research on fixing the so-called replication crisis is truly interested in engaging deeply with concepts like statistical power or replicability. I’m left wondering how many other empirical pro-open science papers are rhetorical feats to “keep up the momentum” regardless of what can actually be concluded from the data.

P.S. On a lighter note related to the title of this post (or not so light if you remember how the quote ends), remember Rocky and Bullwinkle? My dad used to always try to get us to watch re-runs when they came on TV. The other references in the post (also from my dad’s era) are from a Bert Jansch song.

34 thoughts on “Well, today we find our heroes flying along smoothly…”

Andrew on September 24, 2024 1:51 PM at 1:51 pm said:

Jessica:

Thanks for posting. Often this sort of story comes up but we don’t learn what happens next. I have a few thoughts.

1. My general problem with retractions is that they are done so rarely, and they involve so much process with so many bottlenecks, that I don’t see them as anything like a scalable solution to problems of scientific publication. That’s why a few years ago I wrote No Retractions, Only Corrections: A manifesto. In this case, you were able to achieve a reaction, from some combination of you and your colleagues putting in a lot of work, the authors of the paper being reasonable and not fighting it, and something special about this paper, which was in the area of science reform.

2. A related point: You mention some problems with the paper in question, including “the lack of justification, given the study design and missing pre-registration, for implying a causal relationship . . . inconsistencies between the replicability definition and those in the literature . . . over-interpretation of the statistical power estimate . . . contradiction between what they say about their design (and the lack of info on the pilots) and the logic they set up to make one of their central points.” These all seem like problems! Also they’re problems in lots and lots of published research. To put it another way, if lack of justification of causal claims, missing preregistration, inconsistencies, overinterpretation of statistical power calculations, and internal contradictions were enough to get a paper retracted, then they’d be retracting zillions of papers every month. I’m not saying this paper should not have been retracted, just that it’s being held to a higher standard, which, I dunno, maybe fair enough, you have to start somewhere.

3. You write, “What’s puzzled me from the start is that this paper was not only published, it was widely shared by the authors as a kind of victory lap.” This sort of thing continues to annoy me but I wouldn’t say it puzzles me anymore. I’ve just seen it happen over and over. An example last year was that hopelessly flawed GIGO nudge meta-analysis (or, for more formal critiques, see here and here) which was published in PNAS (I guess Psychological Science wasn’t available) and then was promoted by proponents of nudging in one of their victory laps. That paper was never retracted—some minor corrections were made, which is better than nothing, but the result still made strong claims that were not supported by the evidence. Again, this is just one of many many examples. It seems to me that the usual attitude is that if a paper is published by a legit journal, that its findings are considered to be victory-lap-worthy. Annoying but unfortunately no longer puzzling.

4. Regarding your statement, “Preregister if you find it helpful. Make your materials open because you should. But don’t expect these practices to transform your results into solid science”: I agree, indeed I have a post scheduled about why I like preregistration and it’s not about p-values. One way to say this is that preregistration is like many tools for methodological rigor: They can make good science better, but they don’t turn bad science into good science. We could say the same thing about, say, Bayesian inference. Bayesian inference can be great too—there’s a reason I wrote a book about it!—and it can help you quantify uncertainty and extract more information from data, but it can’t turn bad science into good science. If you were to do a Bayesian analysis of some bit of junk science like the ovulation and voting or himmicanes or beauty and sex ratio or whatever, at best the analysis could tell you to give up, that you don’t have the right data to answer your question. It would not be a machine to produce scientific results. Similarly with preregistration, which can protect us from some bad things but does not in itself produce science.

5. This reminds me of something that we’ve discussed before, that the science reform movement promises different things to different people. At some level this sort of thing is inevitable in a world where different people have different goals but still want to work together. For some people such as me (and maybe you, in the past), the science reform movement is a plus because it opens up a space for criticism in science, not just in theory but actual criticism of published claims, including those made by prominent people and supported by powerful institutions. For others, I think the science reform movement has been viewed as a way to make science more replicable. The problem, as we’ve discussed many times (for example, Honesty and Transparency are not Enough is that doing a preregistration or avoiding questionable research practices do not in themselves create good science; what they do is to make it more difficult to get apparently strong results from bad science, and so they should improve science indirectly, by reducing the motivation to run sloppy studies. But I fear that science reform has sometimes been sold or perceived as a set of steps that will allow you to do solid science without reassessing the problems with theory and measurement that led to the replication crisis in the first place.

6. Defenders of preregistration have responded to the above points by saying something like, “Sure, preregistration will not alone fix science. It’s not intended to. It’s a specific tool that solves specific problems.” Fair enough. I just think that a lot of confusion remains on this point; indeed, my own reasons for preregistration in my own work are not quite the reasons that science reformers talk about. As with many tools (including Bayesian inference, and statistical analysis more generally), a tool can have many benefits.

7. Regarding Rocky and Bullwinkle references on the blog, see here. And let me assure you that these cartoons remain appealing to the zoomer generation.

Reply ↓
- Jessica Hullman on September 24, 2024 2:10 PM at 2:10 pm said:
  
  I agree with your points.
  
  >In this case, you were able to achieve a reaction, from some combination of you and your colleagues putting in a lot of work, the authors of the paper being reasonable and not fighting it, and something special about this paper, which was in the area of science reform.
  
  >I’m not saying this paper should not have been retracted, just that it’s being held to a higher standard, which, I dunno, maybe fair enough, you have to start somewhere.
  
  I agree – retraction is rare, not a scalable solution, and this paper got a lot more scrutiny then your average sloppy paper.
  
  Did it deserve to be singled out, given that we probably aren’t going to ever see retraction become mainstream? I guess that’s up to everyone to decide for themselves. For me personally, before becoming so familiar with all the details, I think I might have said no. But knowing what I know now (including that some of the core issues had been brought to the authors’ attention years ago), I don’t think singling this paper out is unjust.
  
  Oh, and long live Rocky and Bullwinkle I guess!
  
  Reply ↓
  - Anonymous on September 24, 2024 3:19 PM at 3:19 pm said:
    
    I agree with most of the criticisms but it makes me feel quite uncomfortable that many issues raised during the review process (and deemed not important enough by the journal back then) are subsequently used by that same journal as reason for retraction. You focus on those issues having been brought to the authors attention, but they were equally brought to the journals attention.
    
    Reply ↓
    - Jessica Hullman on September 24, 2024 8:31 PM at 8:31 pm said:
      
      Yes, this is a great point.
    - Anonymous on September 25, 2024 6:16 AM at 6:16 am said:
      
      There is this paper if I’m not mistaken that resubmitted previously published papers to the same journal years later or something like that. I think they found that a substantial percentage was rejected.
      
      I am not sure if I remember it all correctly, but the point I am trying to make is that there may be a large subjective and/or personal and/or chance part in/of peer-review (which is also why I don’t agree with and/or like it).
Jacob on September 24, 2024 2:33 PM at 2:33 pm said:

I see some folks with long publication lists in which the modal publication has a very long author list. Makes me feel jealous! I have a relatively shorter pub list with relatively short author lists. I tend to think that for career advancement, while most agree in principle that 1 solo-authored paper != 1 large collaboration paper, the first impression left by a long publication list on the CV looms large even as more rational thinking later on an article-by-article basis may involve some discounting based on quality, number of authors, and so on.

But then I see a paper like this where there are some significant problems and 17 authors. Who is doing what? I wonder whether some authors are included less for their work on the overall argument and writing in the manuscript, but because they basically donated data to the effort. I’m not saying that’s wrong, but I wonder if you’re publishing a gazillion things and you got a free ride on this study headed to NHB, you just don’t think that hard about what was really preregistered and so on. Who wants to be the co-author who tells 16 others that we’ve got to slam the brakes on this thing? Seems very easy for a somewhat-innocent bystander to get caught up in some iffy scientific outputs unless you’re detail-oriented and not afraid to rock the boat.

Unrelated, but I also wonder whether this was an adversarial collaboration and the failure to make that clear to readers contributed to this messy end result. Bulk of the author list are science reformers who were inspired by their desire to take down Bem’s parapsych stuff and saw this as another opportunity to show how good practices are the antidote to beliefs in the supernatural. That story either wasn’t supported by the parapsych guy or proved too complicated to tell, so they note that their studies seem to replicate well. (To be honest, the procedures themselves are sufficiently complicated that I struggle to follow along at times, especially given that part of the procedure proved unrelated to the claims made in the paper.)

Also, since when were exploratory pilot studies expected to be preregistered? Maybe I missed the boat on that one. I see the claim in the paper that they had this requirement of the pilots, but I didn’t think this was standard operating procedure in most labs as they imply.

Reply ↓
Joe on September 24, 2024 3:40 PM at 3:40 pm said:

I don’t think I’ve ever heard of the authors of a retracted paper being invited to revise and resubmit, but that’s apparently what the journal asked them to do. Any thoughts? Their plans (including a statement that seems to suggest they are fighting at least some of the objections) are on osf:

https://osf.io/4k5sf

Reply ↓
- Jessica Hullman on September 24, 2024 5:09 PM at 5:09 pm said:
  
  Some people prefer to never say never I guess.
  
  “Also, our findings spurred interest in what occurred in the pilot phase of the project. We are likewise quite interested to learn more about the pilot phase.”
  
  The need to disclose what happened during those pilots was brought up literally years ago by Tal Yarkoni in a pretty damning review. Now they are interested? I guess they just needed a retraction as that extra little push.
  
  Reply ↓
  - HP on September 25, 2024 12:32 PM at 12:32 pm said:
    
    Is Tal Yarkoni’s full review of Protzko et al available publicly? I believe you quoted from it in a previous post, but perhaps that was a different, public critique Yarkoni made.
    
    Reply ↓
    - Jessica Hullman on September 25, 2024 12:53 PM at 12:53 pm said:
      
      https://osf.io/rnvxk
      See manuscript/Decline Effect Appeal Letterfinal.docx
- Anonymous on September 24, 2024 5:46 PM at 5:46 pm said:
  
  It makes little sense because if they could have cleared these concerns you would think they would have had enough time to do so. Seems more likely they’re being allowed to submit the decline effect study as planned and framing it as an R&R
  
  Reply ↓
- Berna Devezer on September 25, 2024 6:39 PM at 6:39 pm said:
  
  That’s not what the journal asked the authors to do. Rather, the editors left the door open for them to write a new paper, I imagine one which they originally designed the study for, that is, a decline effect paper. This is a clear case of retraction. Not a revise and resubmit. The authors know this very well as well. It was a show of good faith by the editors to invite a “new submission,” not an apology for the retraction. The authors’ letter misrepresents the facts of the retraction and the results of the investigation. This was not a case of an “embarrassing mistake”. The inferences did not match the study design. The original intent of the study was obscured in the paper. Certain results (that did not support the flashy conclusions) were either held back or relegated to the supplements. And more. All of this can be verified if anyone is willing to take the time to sort through the project documentation on OSF. Our MA and Jessica’s posts as well as Stephanie Lee’s piece on The Chronicle should make all of this clear enough, regardless of the authors’ version of the story.
  
  Reply ↓
  - Jessica Hullman on September 25, 2024 8:12 PM at 8:12 pm said:
    
    Thank you Berna. This is indeed very important to understand about this case. The level of misrepresentation that led to the final manuscript is not the kind of thing you do “by accident.”
    
    Reply ↓
  - Anonymous on September 25, 2024 11:38 PM at 11:38 pm said:
    
    Quote from above: “The original intent of the study was obscured in the paper.”
    
    This quote reminded me of a discussion back in 2013 on some google group where I posted an idea about pilot studies and replication.
    
    I later posted a more detailed version of this general idea on this blog in 2017 (see comment about this somewhere else on this page).
    
    I think Mr. Nosek might be talking about the project where all the fuss is now about when he replied to my idea and post there but I am not sure. He mentions Schooler there, which is a co-author on this now retracted paper if I am understanding it all correctly. I am linking to that discussion on the google group in case it might be something related to the project and might provide some possibly relevant and useful information for those trying to find out what happened exactly:
    
    https://groups.google.com/g/openscienceframework/c/2nhHMdGGhrw
    
    Reply ↓
    - Anonymous on September 26, 2024 1:31 AM at 1:31 am said:
      
      Quote from above: “This quote reminded me of a discussion back in 2013 on some google group where I posted an idea about pilot studies and replication.”
      
      It’s not about pilot studies per se, but about the more general idea of small groups of researchers replicating each other’s work. In that 2013 post I thought about creating some sort of new “starting” point given the replication crisis and related problematic issues by having researchers come up with a top 5 of studies to “start” with.
      
      After that follow-up studies, possibly based on pilot studies, by each researcher would then be replicated by all the groups, and you could do this multiple times to create “rounds” of theory testing and (re-)formulation. I think that’s a more accurate description of that 2013 post.
      
      I jumped back and forth between the 2013 and 2017 post, and in the latter the word “pilot study” was used and I think that stuck in my head when writing the comment about the 2013 post.
      
      Anyway, that should teach me to not post something before my 1st cup of coffee in the morning. I am having that one as we speak, so let’s hope my descriptions are more accurate now. But please possibly verify and read things yourself when interested.
    - Andrew on September 26, 2024 7:43 AM at 7:43 am said:
      
      Anon:
      
      As I recall from this Stanford conference I went to a few years ago, Jonathan Schooler is ESP-curious. That doesn’t mean he can’t do good science—Alan Turing was himself an ESP believer, once upon a time!—but for someone to be ESP-curious in this day and age suggests that, at the very least, he has a view of scientific evidence that is different from that of most scientists and most readers of this blog.
    - Anonymous on September 26, 2024 8:40 AM at 8:40 am said:
      
      Quote from above: “As I recall from this Stanford conference I went to a few years ago, Jonathan Schooler is ESP-curious.”
      
      What’s up with all these ESP psychologists!? First we had Bem in 2011 with “Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect” and now this mess!?
      
      Perhaps Nosek and Schooler and their co-authors could write a paper about is all, maybe even referring back to their own now retracted paper or using it as an example, titled: “Retrospective Pre-Registration For Prospective Replication” or something like that…
    - Andrew on September 26, 2024 9:30 AM at 9:30 am said:
      
      Anon:
      
      Selection. I’m pretty sure that most psychologists are not ESP-curious. The rare ones who are get attention.
    - Jessica Hullman on September 26, 2024 11:16 AM at 11:16 am said:
      
      Hi Anonymous,
      
      Thanks for sharing the google group post. It wouldn’t surprise me if among them the authors had different perspectives on what they were doing.
      
      What is important to note about paper they published is that they make claims about the entire process using rigor-enhancing practices, including the pilots. From the paper:
      
      “Each lab engaged in pilot testing of new effects based on their laboratory’s business-as-usual practices. These practices could involve collecting data with different sample providers and with any sample size the lab saw fit. All pilots were required to have their materials, procedure, hypotheses, analysis plan and exclusions preregistered prior to data collection.”
      
      The Operations Manual provided to all authors similarly makes clear that preregistration of pilots is expected: “A study must first be registered with OSF before any data collection can begin (including pilot studies).”
      
      I suspect they claim the rigor-enhancing practices apply to the entire process in their manuscript because otherwise, they can only say something like “if you use rigor-enhancing practices to do a high powered confirmation study, followed by high powered replications, the replication rate is relatively high.” How would this be a generalizable scientific contribution? All it tells us is that some psychologists somewhere can identify some real effects. But who really believes that it is impossible for any psychology research to ever identify any real effect?
      
      What we really want to know then is how these effects were discovered, so we can know how much we should expect this result to generalize. The authors claim that the effects were identifed using rigor-enhancing practices like preregistration. This makes the paper’s message seem simple and important: just follow these practices, and you don’t need to worry about your results not replicating.
      
      But there is lots of contradictory evidence about how these pilots were done. Like I say above in my post, I couldn’t find most of these pilot preregistrations they claim exist. The authors have been far from forthcoming in making that info available, despite prior reviewers pointing out how important it is to understand their results.
      
      And then the earliest version of the manuscript (Deciphering the Decline Effect P6_JEP.docx) also describes the piloting process differently:
      
      “It was up to the original labs if they wanted to pre-register all aspects of all pilot studies.”
      
      And Nosek himself implied recently on social media that it was up to the lab how the pilot effects were chosen: https://bsky.app/profile/briannosek.bsky.social/post/3kpcowih4cb2i
      
      So which is it? If the piloting process allowed for p-hacking and they were honest about this, that would be awkward, as they’d be arguing that as long as you do a high powered confirmation study of your p-hacked result at some point, you’re golden. It makes no sense.
      
      So there must be something different about these pilot effects, and it does not appear that the key difference is rigor-enhancing practices. So … these effects are special in some other way, perhaps because they were chosen knowing that they would later be replicated. The authors try to suggest this isn’t the case with a post-hoc survey, but its not convincing.
      
      This is why it’s irritating to see the authors claiming that the paper was retracted simply because they mistakenly said they preregistered a couple of the main analyses involving replicability. There’s a lot more misrepresentation than that.
    - Jessica Hullman on September 26, 2024 1:21 PM at 1:21 pm said:
      
      Argh, I said “real effects” above in my reply, implying the kind of dichotomous black/white thinking I generally try to avoid slipping into. Should have said something like “All it tells us is that some psychologists somewhere can identify some significant effects using NHST that are also significant and in the same direction when the study is repeated with high N.”
    - Anonymous on September 26, 2024 1:36 PM at 1:36 pm said:
      
      Hi Jessica Hullman,
      
      I worry they mix things up, and in doing so make it seems like something it isn’t. I commented on this on the “more on possibly rigor-enhancing practices in quantitative psychology research” post as well. I am the same “Anonymous” that posted the two comments about “rigor enhancing practices” there on nov. 21, 2023,10:45 am and the reply to that on nov. 25, 2023, 1:30 pm.
      
      I browsed through the preprint again as a result of your reply here, and a few sentences stood out for me:
      
      1) “Based on pilot experimentation, each of the four labs submitted four new candidate discoveries for a self
      -confirmatory test and four replications, (…)”
      
      2) “Each of the 16 ostensible discoveries were obtained through pilot and exploratory research conducted independently in each laboratory”
      
      3) “It is likely that we observed high replicability because of the rigor-enhancing methodological standards adopted in both the original research leading to discovery and the rigor in replication.”
      
      I think here’s what happened. All labs performed exploratory research, using pilot studies or whatever you want to call it, and/or perhaps even already replicated findings themselves before the “self-confirmatory” test and the replications by the other labs.
      
      Maybe they used “rigor-enhancing” methods in that exploratory/pilot study phase and maybe they didn’t. Who knows. Maybe they even self-replicated already in that phase. Who knows. Maybe they investigated dozens of things and performed hundreds of pilot studies. Who knows.
      
      In my interpretation of the preprint the official “rigor-enhancing” methods only come into play at the “self-confirmatory” and “replication by other labs” phases, at least “officially” from what is written in the preprint. If that is indeed what is written in the preprint problems arise for me with quote 3). The words, and interpretation of, “original research leading to discovery” might be crucial in this all. As is what actually happened in the “exploratory”/pilot test phase.
      
      If the words “original research leading to discovery” are interpreted in a way so that it includes the “self-confirmatory” phase where “rigor enhancing” methods were “officially” used then quote 3) makes sense to me technically. But that leaves out everything that may have happened during “original research leading to discovery” before the “self-confirmatory” phase (which may already have included self-replication, who knows).
      
      If the words “original research leading to discovery” are interpreted in a way that does not include the “self-confirmatory” phase where “rigor enhancing” methods were “officially” used then quote 3) makes no sense to me technically.
      
      I’m extremely tired now so please verify and check things. I wanted to comment and try and contribute, but reading this stuff costs me so much energy it’s unbelievable. That’s part of why I don’t read stuff like this anymore. But I am happy to try and read again, or some more, or answer questions you might have at a later time if you want me to try. I’ve written a title of a manuscript I posted on SSRN somewhere in the comments here, and if you search that there you can find my e-mail address should you want to contact me. Thank you, and the authors of the paper (Berna Devezer and Joseph Bak-Coleman if I am not mistaken), and other people for writing about the problematic issues of the now retracted paper.
  - Joe on September 26, 2024 1:46 AM at 1:46 am said:
    
    Thanks for the clarification. I was just going on what Protzko had written on Twitter. But it’s a pretty fine line between “a journal leaving the door open for them to write a new paper” and “revise and resubmit”.
    
    I still find it a discrepancy between what you and Jessica say about the paper and the journal leaving the door open to submit a new version of it. If the journal accepts the characterization of what you and Jessica say, I think they should also accept that the authors just can’t be trusted with this particular paper (and to be honest, maybe others as well), no matter how famous they may be.
    
    I don’t know what the proper response for something like this may be. Maybe there really is value in the study, but it just seems tainted to me at this point.
    
    Reply ↓
    - Jessica Hullman on September 26, 2024 10:29 AM at 10:29 am said:
      
      I agree it’s hard to imagine what this new manuscript would even look like. The paper got retracted in part because it tried to pull a seemingly compelling but unsupported story out of the data they had. If they have to be honest, such as by reporting on what they original preregistered, it’s not clear what the contribution is.
      
      Fwiw, as an associate editor at a journal, I have never rejected a paper with no option to start over and try again. I think of it as a polite way to say “You need to rethink this entirely.”
    - Berna Devezer on September 26, 2024 1:53 PM at 1:53 pm said:
      
      I agree with Jessica and think of it this way: The data and the preregistration are publicly available. Since anyone can use that data and preregistration to properly analyze and write the appropriate way, why shouldn’t the authors be able to? It would not be a revised version of this paper, for sure. So it’s important to maintain trust and good faith in any social system even if it may create vulnerabilities. One of the roles of the editors is to incentivize honest science and my view is this is what they’re doing here. A sincere show of good faith and incentive to do it the right way.
Marginal Revelation on September 24, 2024 7:00 PM at 7:00 pm said:

Small if true

Reply ↓
Anonymous on September 25, 2024 4:41 AM at 4:41 am said:

Quote from above: “I hadn’t really considered that open science could be doing harm, beyond maybe encouraging a different set of rigor signalling games.”

That’s one of the things I have been wondering more and more as well. Also due to a paper by Edwards and Roy (2017) titled “Academic research in the 21st century: Maintaining scientific integrity in a climate of perverse incentives and hypercompetition. Environmental Engineering Science, 34, 51-61”. Table 1 contains a list of several proposed “incentives” in academia and/or the scientific process in the past which may have resulted in lots of problems and issues.

I worry about many proposals of recent years, and I worry most about folks being critical enough and not wondering enough how things might have negative effects (in the future). Should people want to read some more about such questions related to recent proposals I incorporated some of them in a manuscript that can be found on SSRN titled “Psychological Science Replicates Just Fine, Thanks”.

The general idea of pilot studies, people replicating each other’s work, etc. might still be interesting to further think about and has also been mentioned in a certain way on this blog back in 2017 in the comments. Should people want to ponder some more about this more general idea, I am linking to the comment here:

https://statmodeling.stat.columbia.edu/2017/12/17/stranger-than-fiction/#comment-628652

Reply ↓
- Jessica Hullman on September 26, 2024 10:54 PM at 10:54 pm said:
  
  Thanks for the comment. Seems you’ve been thinking about some of the same things that others of us involved in this situation have also come to realize…. that there are certain parallels between the sloppy work that is the target of replication crisis criticism and the work being advanced by those doing the criticizing that makes the two camps harder to separate than appears at first glance.
  
  Reply ↓
Anonymous on September 26, 2024 6:30 AM at 6:30 am said:

To anyone interested in this story, I also recommend Bak-Coleman’s blog: https://joebakcoleman.com/blog/2024/protzko/

What a mess. I can only express my thanks to you Dr Hullman, and then to Dr Bak-Coleman and Dr Devezer for helping correct the record. I can’t imagine the time and mental pressure of putting this all together and doing so carefully.

It is really shocking to see so many ‘big’ open science figures continue to defend the original paper and minimize the errors and problems. Not to even mention the outing of whistleblowers and concerned researches. If those working in open science aren’t concerned about doing science well, but just being buddies, then that’s just another hard pill to swallow.

Reply ↓
- Jessica Hullman on September 26, 2024 12:13 PM at 12:13 pm said:
  
  Thanks for posting. Joe does not mince words. This definitely resonates with me:
  
  “Whatever your thoughts, ask yourself: How many changes, updates, switched analyses, outcomes, rewrites would it take to call this more than just an “embarrassing mistake”? How many mistakes are we allowed until we cannot shrug them of brazenly? At what point should we be held accountable for allowing mistakes to accumulate over years despite repeated feedback? How many “oopsies” would it take before we interpret their “aw shucks” posts in the last few days as lies intended to cover up their actions?
  
  Reply ↓
Anonymous on September 26, 2024 9:26 AM at 9:26 am said:

Has anyone thought about the option that Nosek and his co-authors are merely attempting to provide evidence to justify the title of a different paper Nosek co-authored titled: “Preregistration Is Hard, And Worthwhile” Trends in Cognitive Sciences, Volume 23, Issue 10, 815 – 818

I haven’t read the paper, and I don’t know whether it, and/or the title, is sarcastic or not but perhaps that could be the case. I mean if preregistration is really that hard, it would make sense that Nosek could make this clear because he is involved in it all for so long now. Maybe he will present a possible solution for all the hard parts of/in preregistration. Or maybe he already did in the paper, who knows. I haven’t read the paper so I don’t know why preregistration would even be that hard.

Reply ↓
- Anonymous on September 26, 2024 4:13 PM at 4:13 pm said:
  
  Quote from above: “Has anyone thought about the option that Nosek and his co-authors are merely attempting to provide evidence to justify (…)”
  
  Yes.
  
  This.
  
  I think it might just be this.
  
  Reply ↓
Anoop on October 17, 2024 10:07 PM at 10:07 pm said:

I still cannot wrap my head around the retraction of his paper. It is just mind-boggling that some of the very people who kick-started the preregistration and open science movement, and who were literally preaching to everyone about how to do science ‘correctly,’ can have so many glaring issues with the exact same issues. What makes it even worse is that this is THE paper that was supposed to show us the benefits of preregistration and transparency. Some of the comments here, like ‘oh well, these are the same problems with a lot of research, and hence a lot of papers should get retracted,’ completely misses the point. This paper and its authors are supposed to lead by example, and if they have issues, they should be the first to own up to their mistakes. Honestly, I don’t think I can keep a straight face listening to Brian Nosek speak about preregistration and transparency anymore!

Reply ↓
- Andrew on October 17, 2024 10:56 PM at 10:56 pm said:
  
  Anoop:
  
  When this came up before, my response was that the science-reform movement is an awkward alliance between old-school researchers (like that ESP researcher who was part of the recently retracted paper) who had the impression that preregistration would allow them to get bullet-proof results and radicals like Uri Simonsohn who had the impression that preregistration would root out the bad science.
  
  The recently retracted paper was the result of a years-long study that bounced back and forth between the three goals of: (a) validating the paranormal hypotheses favored by some of the team, (b) shooting down bad work that would not replicate, and (c) providing evidence that the science reform movement itself had beneficial results. Goals (a) and (b) were in conflict, so it makes sense that they settled on goal (c).
  
  I’m not saying that this was their master plan all along; it looks more like they kept kicking the can down the road, keeping the coalition together as long as they could. And even now after the retraction they appear to be holding it together, politically speaking. I imagine that the lords of psychology and their favored proteges will still be able to publish in Science/Nature/PNAS, go on NPR/Freakonomics/Gladwell/Ted, etc.
  
  And from their point of view, it’s all about improving science. Even though some of them think this improvement will be achieved by increasing public confidence in existing work, while others of them think this improvement will come about by revealing the emptiness of so much that has been published and promoted.
  
  Credit to the science reform reform movement for holding the science reform movement up to the light.
  
  Reply ↓
- AAAnonymous on July 4, 2026 5:59 AM at 5:59 am said:
  
  Quote from above: “Some of the comments here, like ‘oh well, these are the same problems with a lot of research, and hence a lot of papers should get retracted,’ completely misses the point. This paper and its authors are supposed to lead by example, and if they have issues, they should be the first to own up to their mistakes. Honestly, I don’t think I can keep a straight face listening to Brian Nosek speak about preregistration and transparency anymore!”
  
  Partly as a result of this whole retracted paper saga, I started wondering and pondering and reading some more papers by certain people about certain things. Pre-registration doesn’t seem that hard to me, and I think this whole retracted paper-thing is very odd. Anyway, this all interacted and combined with some earlier things I noticed with certain research where certain people were involved. And I kept thinking about the Levelt et al. (2012) report about the Stapel fraud case in which the term “sloppy science” seems to have been introduced. I wonder whether this case of the retracted paper can be seen as a further example of “sloppy science” (to say the least) by (social?) psychological scientists. Perhaps someone should tell some people that certain things are perhaps not meant to be replicated!
  
  Anyway, this all resulted in a manuscript I have written titled “Pre-registration, grocery lists, and particular pre-registration issues” which can be found on SSRN. The manuscript mentions some other research and findings of certain particular people, has a specific focus on pre-registration and associated proposals and projects, and points out some issues that might be seen as being noteworthy. I hope the manuscript might be interesting or useful, also in light of this retracted paper saga. Perhaps there are similarities, or perhaps there is an overlap, concerning certain things. I reasoned this here might now be a good spot to mention it.
  
  Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Well, today we find our heroes flying along smoothly…

34 thoughts on “Well, today we find our heroes flying along smoothly…”

Leave a Reply Cancel reply