This is Jessica. Many researchers are thinking about what we should do about scientific peer review now that AI makes producing papers so much easier. Submission numbers keep getting higher — in the past week, I saw reports that the most recent ACL submission cycle got 17k+ submissions, up from ~10k last cycle. TMLR went from getting 500 submissions every 60 days or so to getting the same number ever 19 days. There are simply not enough human reviewers to handle the surge, at least not without a dip in quality. The noiser the review system gets, the greater the incentive to submit sloppy papers, because you might get lucky. This is the so called “review death spiral.”
It is a hard problem. Quotas on submissions per author are one avenue forward, which TMLR just announced it would adopt. Not surprisingly, many reviewers are also turning to AI to help. The question becomes how to design AI review protocols to help reduce some of the noise, through preliminary filtering or flagging or helping guide human attention to parts of a paper that are most likely to be problematic.
But what sorts of checks should an AI review assistant run on a paper? It’s useful to separate basic integrity violations AI could flag, like is there evidence of plagiarism, fake citations, missing code/data to reproduce main results (which are comparatively less controversial) from “epistemic filters,” like does the paper pass replicability checks, robustness checks, preregistration checks, statistical significance checks, etc. There’s a temptation to blur these things in proposing how to apply AI to review. It’s easy to assume that the metascientists have already established that practices like replicability or preregistration are truth-indicating and we can just implement them at scale (and indeed, ML researchers are citing open science and other reform arguments to back their proposals).
But if there’s one lesson to be learned from the aftermath of the replication crisis, it’s that there is no small, stable, non-conflicting set of detectable signals of good science that will find the good stuff and reject the bad. There are heuristics that can be useful prompts for deliberation – get in the habit of preregistering, make sure you can replicate your results, test the sensitivity of your results to choices you made along the way – but things get weird when we start treating them like universal requirements. Authors shift attention away from unrewarded signals, like better theory or exploratory work, and become preoccupied with rigor signaling through their methods. The result is not necessarily more thoughtfulness.
And so even if the AI review tools we create are simply intended to inform human reviewers about what checks a paper passed, what we implement will have important policy implications by incentivizing more work like that in the future. I don’t think we are in a good position to predict what happens if suddenly we require multiverse robustness or statistical significance in a field like machine learning, which has in many ways been all about iterative improvement and “frictionless reproducibility” rather than individual results passing all the robustness checks.
The answer is not to avoid using AI in review until we can find a non-gameable set of credibility qualities to have AI focus on, as some have recently argued (though I agree with the linked paper that we need more rigor in how we go about motivating review tools). Non-gameability sounds nice, but any automated review policy that allocates attention will be gameable, because ensuring good science is not so simple as finding the right checklist. The relevant question is instead what assumptions and downstream incentives we are willing to tolerate. To this end, at the very least we should get in the habit of spelling out the assumptions we’re making, so that the trade-offs of focusing on particular proxies become explicit.
I wrote up this view recently in a paper called “Stop Treating Metascientific Heuristics as Quality Filters in AI Review.” Here’s the abstract:
AI-implemented checks for reproducibility, robustness, preregistration, claim scope, and other intended proxies for scientific credibility can extend human reviewers’ capabilities. However, treating metascientific heuristics–whose theoretical grounding remains contested or incomplete–as necessary and sufficient signals for filtering out bad science is counterproductive to scientific progress. The emerging literature blurs the line between integrity filtering, based on necessary but insufficient signals of validity like reproducibility of stated results or lack of fake citations, and epistemic filtering, which uses machine-detectable signals to judge scientific quality. Drawing on critical metascience, we show that commonly proposed signals of research quality are insufficiently justified as general indicators of scientific value. The answer is not necessarily to ban AI in review, given the deluge of submissions venues are facing. Instead, in recognition of how any use of automated signals–even when deployed with human oversight–will shape attention and create incentives upstream, developers of AI review tools should explicitly specify their assumptions about how proxy signals inform on scientific quality in the context of specific review decisions. This approach treats AI review contributions as contestable decision policies that will shape future research, acknowledging the value-laden nature of scientific judgment and surfacing relevant tradeoffs.
Rather than arguing for or against any particular proxies, I’m more interested in the methodological and philosophical mindset we should bring to the new questions raised by AI review. To demonstrate what I mean by more explicit motivation, I analyze an example review decision problem and set of detectable signals in the appendix, drawing on an analysis of how statistical significance and exact replication success relate to signal-to-noise ratios measured under error from a recent paper by Eric van Zwet, Andrew, and Witold Więcek. The takeaway is that the value of a proxy will depend on how you define the latent state you care about (e.g., whether the direction of an effect was correctly estimated, how big the true signal-to-noise ratio is), what you assume about the generating process (i.e., how the proxy noisily reflects the latent state), and what you assume about the decision-maker’s choice of actions and utility function. By suggesting this approach, I am *not* suggesting that one can validate a new review tool’s utility before its been deployed. The point is that there will be trade-offs no matter what, and the best we can do is be concrete about the kinds of assumptions that have to hold for proxies to be useful in review, so the community can debate what risks they are willing to accept.
In this sense, my argument is very much along the same lines as Devezer et al’s argument that those proposing reform procedures should adopt more formal methodology to avoid unwarranted overgeneralization. Once checks become part of review infrastructure, they stop being neutral diagnostics and become policy levers. Let’s start treating them as such in research on AI review.
1) I just don’t understand the peer review thing, with or without the help of A.I.. Just post a paper, and the scientific community, or parts of it, will sort things out and people can focus on what criteria they think are most useful. This could, for instance, be the criterium of having direct personal experience with using the work of a certain scientist and finding out that this work has been useful for one’s own work and has proven over time to regularly be useful. Or, it could be focusing on those factors one thinks are most important, at least as a starting point (e.g. availability of a public pre-registration).
Anyway, here are some thoughts about peer review I have mentioned on here several times that make at least a little sense to me:
https://statmodeling.stat.columbia.edu/2025/08/03/you-can-cite-peer-reviewed-research-in-support-of-almost-any-claim-no-matter-how-absurd/#comment-2401742
2) I recently read some things about the difficulty to determine the quality of research. And I have read about the possible problems with focusing on certain factors seen as being indicators of quality or seen as desirable processes or characteristics. Both seem in line with your possible interests and current manuscript, so I reasoned it might be useful to mention here for possible future thought or writing.
Higgs, M. D. & Gelman, A. (2021) “Research on registered report research”. A quote from that paper:
“Does the scientific community agree on the most important characteristics of research quality or research credibility, as well as on how to measure them? Individual researchers have different standards and tend to focus on different aspects of quality, not to mention the differences that exist among disciplines. This is just something we should have in mind.” (p. 979)
Klonsky, D. (2025) “Campbell’s Law Explains the Replication Crisis: Pre-Registration Badges Are History Repeating”. A quote from that paper:
“Pre-registration mandates are positioned as an antidote. However, I argue that such efforts, perhaps best exemplified by pre-registration badges (PRBs), are history repeating: Another useful tool has been converted into an indicator of strong science and a goal in and of itself. This, too, will distort its use and harm psychological science in unanticipated ways.”
I thought these were interesting as well:
Tennant, J. P. & Ross-Hellauer, T. (2020) “The limitations to our understanding of peer review”. From the abstract:
“Peer review is embedded in the core of our knowledge generation systems, perceived as a method for establishing
quality or scholarly legitimacy for research, while also often distributing academic prestige and standing on individuals. Despite its critical importance, it curiously remains poorly understood in a number of dimensions. In order to address this, we have analysed peer review to assess where the major gaps in our theoretical and empirical understanding of it lie.”
Heese, R. & Bright, L. K. (2021) “Is Peer Review a Good Idea?”. From the abstract:
“Prepublication peer review should be abolished. We consider the effects that such a change will have on the social structure of science, paying particular attention to the changed incentive structure and the likely effects on the behaviour of individual scientists. We evaluate these changes from the perspective of epistemic consequentialism. We find that where the effects of abolishing prepublication peer review can be evaluated with a reasonable level of confidence based on presently available evidence, they are either positive or neutral. We conclude that on present evidence abolishing peer review weakly dominates the status quo.”
From your manuscript: “Indeed, a growing critical metascience literature argues that many reform proposals–despite their good intentions–have been subject to the same kinds of overclaiming that they aim to critique, (…)”.
I came across the papers I mentioned in earlier comments when writing a manuscript that ties in with the quote above. Here’s a portion:
“Soderberg et al. (2021) investigated Registered Reports (RRs) and non-RR comparison papers by having reviewers “(…) peer review two published articles and evaluate qualities such as creativity and rigour on rating scales (…)” (p. 995). Higgs and Gelman (2021) commented on Soderberg et al. (2021) and note that “It’s hard to tease everything apart, and this is a great example of nuances that researchers studying new research practices need to consider.” (p. 978). I looked at the study details of Soderberg et al. (2021) to try and understand how things were investigated.
(…)
“It might be noteworthy, if I am understanding things correctly, that a substantial part of the RRs used in the study by Soderberg et al. (2021) do not include a link to the pre-registration (see Table 1, p. 993). If this is correct, one could wonder whether the reviewers looked for such a link when reading words like “preregistered” in the excerpts. Or one could wonder whether the reviewers assumed such a link is present somewhere else in the actual paper, but just not in the specific excerpt they were reading. Or one could wonder whether the reviewers read a word like “preregistered” and did not even think about, or look for, a link to pre-registration information. The availability of a link to pre-registration information might be a factor that some take into account in judging whether a study is “rigorous” or “high quality” or “preregistered”. And some might wonder whether reported “preregistered analyses” in such studies without a link to publicly accessible pre-registrations even deserve to be called “preregistered”. For some, this might all be crucial in judging the rigour and quality of a study, which might tie in nicely with Higgs and Gelman (2021) who write: “Individual researchers have different standards and tend to focus on different aspects of
quality, not to mention the differences that exist among disciplines. This is just something we should have in mind.” (p. 979). This all might provide a useful example of how attention might be directed to one thing, and not to another thing.”
(…)
“Perhaps there is a real risk that sub-optimal, severely flawed, or bad research is nonetheless used to introduce, underline, or expand certain projects or initiatives. Perhaps this ties in with what Higgs and Gelman (2021)
mention: “Soderberg et al. will likely be cited often, as support for various arguments regarding the effectiveness of registered reports in research, and most notably for the increase they found in research quality.” (p. 978). Even just a single, perhaps flawed, study can heavily influence things, and can facilitate taking a certain step. That step might be a step in the wrong direction, and might not easily be corrected.”
Quote from your manuscript: “We observe a blurring of a distinction between using AI for integrity filtering to using AI for epistemic filtering in emerging work on AI review.”
Perhaps the same possible issue is also present in certain proposals for improvements themselves such as the case of so-called “Registered Reports”. Or, at least to me things seems to overlap. For example, this is what I have written about Registered Reports, citing the Heesen and Bright (2021) and Tennant and Ross-Hellauer (2020) papers mentioned earlier:
“The above mentioned issues might also be a useful reminder with respect to other proposals and processes involving pre-registration and associated projects. For example, Registered Reports is a format where “Reviewers evaluate the importance of the research question and the quality of the methodology used to evaluate that question.” (Soderberg et al., 2021, p. 990) before results are known, and in most cases before data collection. It can be argued that scientists should decide themselves what the importance of certain research is, and to what work they direct their attention (see Heesen & Bright, 2021, pp. 649-650), and
from that perspective, explicitly asking reviewers to evaluate the importance of the research question (see Gerpott et al., 2024, p. 3; Soderberg et al., 2021, p. 990) might not be desirable. It might further be the case that reviewer characteristics heavily influence this evaluation in undesirable ways (see Lee et al., 2013, pp. 8-10; Tennant & Ross-Hellauer, 2020, p. 8).”
Are you actually aware of empirical cases where the peer reviewer rejection of a Registered Report actually results in a scientist deciding to address an substantively different question in their research programme? It seems to me that such a thing is probably unlikely, given that presumably it is as easy to get a Registered Report (self?)published somewhere as it is for any other research; not to mention that the scientist can of course argue back against reviewers. It seems to me that you might be assuming that the peer review of Registered Reports does more work vis-a-vis the evolution of science than it actually does, at least in the majority of cases.
I reason similar issues concerning peer-review and the direct or indirect influence on which papers and topics get published migh play an even bigger role in Registered Reports. For several reasons, one being that the influence of journals and peer-reviewers are present at an earlier stage compared to “normal” research, at the so-called “stage 1” phase where the research has not even been performed in certain, or even most cases with Registered Reports.
You can’t technically publish a “Registered Report” elsewhere if I am understanding things correctly, because it’s intertwined with the journal and peer-review if I am not mistaken. That’s where I see most problems or possible future issues. I think out loudly about such things in “Pre-registration, grocery lists, and particular pre-registration issues” which can be found in SSRN. Here’s a further section about Registered Reports:
“Considering some of the issues concerning journals, editors, and peer review mentioned earlier, it might also be useful to wonder whether explicitly incorporating peer review in early stages of research and manuscript submission in the Registered Reports format (see Chambers & Tzavella, 2022, p. 29) might somehow influence authors and their research. Might the Registered Reports format be especially susceptible to coercive citation (see Chorus & Waltman, 2016, p. 2; Fong & Wilhite, 2017, p. 5)? Or, could this format result in even more pressure, or ways in which, to please the reviewer (see Binswanger, 2014, p. 56)? And maybe the Registered Reports format can involve other, or completely new, processes that may not be beneficial. For instance, could the Registered Reports format, and incorporating Registered Reports into conventional publishing and review (see Gerpott et al., 2024, p. 6), provide authors and peer reviewers with new ways to sabotage, manipulate, and delay certain things (e.g. see Anderson et al., 2007, p. 453; Scheel et al., 2021, p. 9; Smith, 2006, p. 180)? If Registered Reports are connected to funding models (see Chambers & Tzavella, 2022, p. 38; Gerpott et al., 2024, p. 6), peer-communities (see Eder & Frings, 2021), or training and accreditation for editors and journals (see Chambers & Tzavella, 2022, p. 39), could such things result in even more influence of a small group of people (see Anderson et al., 2007, p. 452; Heesen & Bright, 2021, p. 647; Martin, 1992, pp. 88-90; Tennant & Ross-Hellauer, 2020, p. 3)? And could Registered Reports, and associated initiatives, contribute to even more (future) research bureaucracy (see Binswanger, 2014, pp. 65-70), “technosolutionism” and monetization of platforms and services (see Andrews, 2020)?”
I can’t reply directly to your response, but as far as I can tell your response says “no, I have no evidence, but here are some more hypotheticals”.
Very clear and very useful. Thanks Jessica
Like NHST and interpreting arbitrary regression coefficients, institutionalized peer review was widely adopted ~1950 despite no evidence it actually works.
See that mind-body healing thread. You can’t measure healing after only 28 minutes, so the premise of the paper can be immediately rejected. But peer review did not do so, and thats the rule rather than an exception.
It makes no difference to me whether something was peer reviewed, what do people think they are getting out of it?
I am by no means an expert, but I think the idea of “institutionalized peer review” is too ambiguous. It appears (https://pmc.ncbi.nlm.nih.gov/articles/PMC9014922/) that it was developed in the 19th century (perhaps not “institutionalized” at that point), and in any event, I think it was preceded by editorial decision-making – a different sort of peer review. With the proliferation of scholarly activity, I suspect that a clearer division of labor between editors and reviewers emerged. Widespread adoption of peer review “worked” in terms of the needs of a large scale publishing industry. Whether it ever “worked” in terms of publication quality is a different matter. Now that AI poses new challenges for scientific publication, I think we will see new types of “review” emerge. Whether they “work” will be decided on the basis of whether they meet the needs of a massive increase in submitted research and a corresponding massive decrease in the quality of much of that, not on the basis of whether they enhance the quality of scholarly work.
I can agree with your last sentence – peer review tells me little. But I think we are at a loss for replacing it with something better. Despite a number of attempts (preregistration, pre- and post- publication review, etc.), none of these seem capable of keeping up with the scale of what is changing. I’m in the process of writing a book (about analysis with AI) – how quaint it seems! What do you put in a book, when enhanced search can easily exceed what I can write? And who will read a book rather than using AI to replace it – or use AI to summarize it if the book is somehow ‘required reading?’ By the time I formulate answers to such questions, the world will already be different.
Dale –
Yes. Finding flaws in peer review is easy but implementing better alternatives is complicated.
Peer review* is already the replacement, for direct replications. It would have been nice if looking at symbols and pictures was a good heuristic for actually repeating the study, but unfortunately it doesn’t work.
* This, once again, is referring to the institutionalized/bureaucratic version. Getting feedback from others isn’t at issue.
I wonder how direct replications will perform in the AI era. It should be easy to ask an AI to create a replication, make up the data to ‘verify’ the original results, and then pronounce the original study replicated. It seems far easier to game the system than to do it legitimately.
I’d say its already been trivial to generate realistic fake data with minimal programming skills for decades, so the bar is a bit lower with AI but that doesn’t change much.
And of course the best replications are those performed by people/groups in competition who disagree with each other.
If there will be a conspiracy using AI to repeatedly fabricate results across multiple groups, then the final test is whether these groups can ever predict anything or perform some kind of useful feat.
Anoneuoid –
This is fairly tangential to Jessica’s focus on AI review, but it seems relevant to the broader conversation.
My understanding is that there’s research showing a strong majority of authors report that, on balance, peer review improves their papers by resulting in clearer exposition, stronger limitations sections, etc., and only a small minority say it adds zero value to their papers (let alone research).
I’d guess most authors would also say that peer review can be a pain in the ass, and that the extra work sometimes (often?) is not worth the benefit. And of course, it is important to note that improving a paper isn’t the same thing as improving the underlying research.
But invoking the Nirvana fallacy doesn’t get us anywhere good. The fact that you personally don’t care whether a paper was peer reviewed doesn’t mean peer review didn’t improve the paper. And the fact that bad research passes peer review doesn’t justify the counterfactual you’re implying – that peer review never results in better papers than would have been published otherwise.
Of course it’s a mistake to assume that peer review guarantees flawless research. But I see a lot of people using the flaws of peer review as part of a larger critique of the institutions of science and scientific expertise, where ‘do your own research’ becomes a replacement for peer review. In the end, for all the legitimate criticisms of peer review and institutional science, that’s a losing tradeoff, imo. It doesn’t have to be a binary. And citicism of peer review should be contextualized within the larger societal frame of how it links to celebrity contrarians like Bobby Kennedy steering public health narratives, and the broader Joe Roganizing of public scientific discourse.
Quote from above: “My understanding is that there’s research showing a strong majority of authors report that, on balance, peer review improves their papers (…)”
I am not sure authors’ view on things are most relevant or important here. In my view it should be about whether or not peer review actually improves papers (however difficult that may be to measure or determine). Point is that the views or feelings of the scientists about it should perhaps not be the most important.
To perhaps make my case clearer, here’s something I did not end up using in my manuscript mentioned elsewhere in the comments here, but can use now. In a paper by Bakker et al. (2020) titled “Ensuring the quality and specificity of preregistrations” the following can be read:
“Note that these Transparency Scores were called “Restriction Scores” in the preregistration as it concerns descriptions that restrict opportunities for researcher degrees of freedom, but as these descriptions entail transparency about the research process, we use “Transparency Scores.” We thank
a reviewer for this suggestion.”
Well, I as a reader does in fact NOT thank a reviewer for this suggestion. The new name seemed less appropriate to me concerning what is actually being captured or measured at the time of reading this all, and the differences between the names in the final paper and pre-registration was annoying and made things more complicated. But, please check and verify though, it’s been a while since I read it all.
“My understanding is that there’s research showing a strong majority of authors report that, on balance, peer review improves their papers by resulting in clearer exposition, stronger limitations sections, etc., and only a small minority say it adds zero value to their papers (let alone research).”
I agree with AAAnonymous here in that it’s not clear that this is much of a signal: for fatally (or near fatally-flawed) research (e.g. choosing the best model via some algorithm focused on prediction and then presenting it as if it were a priori and causal) I would expect a peer reviewer ultimately accepting the work to offer some improvements but not to reject the paradigm. This seems the most likely scenario if the large part of a broad research programme are stuck inside a broken model. Hence this result where authors predominantly say that peer review helps. Peer reviewers rejecting the whole premise typically don’t get invited back.
Anon and AAanonhmous –
Both of these replies rest on an unfalsifiable counterfactual. We can’t prove that authors are correct when they say peer review improved their papers, but I’d say it’s more probable that they’re right about their own experience than that the strong majority are mistaken.
And we can’t prove whether a flawed paradigm is being polished rather than improved. Those are inherent unknowns.
The real question is whether we want to discard potential incremental improvement inside an imperfect paradigm.
Until we find a perfect paradigm, rejecting incremental improvement in favor of philosophical purity is its own epistemic choice. Maybe that choice is justified, but it shouldn’t be ignored that we’re making epistemic tradeoffs.
And in the meantime, it’s a hard reality that celebrity contrarians like Bobby and Joe leverage a broad critique of peer review to inflict real damage. Is that damage outweighed by some benefits? I’m open to the argument if someone where to lay it out in some detail, but I will admit to a strong “prior” that the net balance is heavily negative.
Joshua:
I don’t think it’s quite accurate to describe Bobby and Joe as “contrarians,” given that they are supporting the views promulgated by the U.S. government and major media moguls.