This is Jessica. There’s been some debate among computer scientists about what policies conferences should adopt for papers with hallucinated references. An independent analysis turned up at least 53 NeurIPS 2025 papers that were accepted (and presumably presented) at the conference in December but which had at least one hallucinated reference.
The question is, what should the default policy be if a paper is found to have at least one hallucinated reference? Should we conclude that these papers should have been rejected, and retract them? Should we instead let authors correct them? Going forward, should we desk reject papers with at least one hallucinated reference? What exactly can be concluded about the quality of the rest of the paper if you find at least one hallucinated reference?
The NeurIPS board statement suggests leadership is uncertain what to do about these papers:
“The usage of LLMs in papers at AI conferences is rapidly evolving, and NeurIPS is actively monitoring developments. In previous years, we piloted policies regarding the use of LLMs, and in 2025, reviewers were instructed to flag hallucinations. Regarding the findings of this specific work, we emphasize that significantly more effort is required to determine the implications. Even if 1.1% of the papers have one or more incorrect references due to the use of LLMs, the content of the papers themselves are not necessarily invalidated. For example, authors may have given an LLM a partial description of a citation and asked the LLM to produce bibtex (a formatted reference). As always, NeurIPS is committed to evolving the review and authorship process to best ensure scientific rigor and to identify ways that LLMs can be used to enhance author and reviewer capabilities.”
To make things concrete, consider a hallucinated reference to be a citation listed in the references section of the paper where the average reader cannot (in a reasonable amount of time) determine the identity of the cited paper well enough to track it down. That is, even if the hallucination is a transformation of what was originally a valid citation, the transformation is severe enough that it’s not obvious what the paper is. Hallucinated references are distinguishable from more minor errors like syntax issues or other errors that affect the citation but don’t prevent you from still easily tracking down what was intended.
I think we should be asking ourselves: What would we do if we found there was hallucinated evidence, such as experiment results? And we should treat these papers with hallucinated references as equivalently problematic. It doesn’t matter how many hallucinated references. It doesn’t matter how “valid” most of the paper is, or the probability that the main conclusions are correct conditional on finding a hallucinated reference. As most of us learn in primary school, a key reason authors cite relevant prior work is to help establish support for claims they make. If we don’t necessarily require that those references link to real research, then what are we even doing?
For the NeurIPS board to say that “the content of the papers themselves are not necessarily invalidated” suggests that they think some degree of fictionalized evidence is tolerable, if it happened through honest mistake. A friend recently relayed to me such a horror story, in which, in a last minute rush before a deadline, they gave an LLM the full correct citations for their paper and prompted it to fix some minor formatting issues to conform with the required format. They submitted the results in time, only to find that the model had added a single citation to a non-existent paper, listing them as an author alongside some renowned researchers in the field. Imagine anyone you look up (much less one of the big wigs you associate yourself with in the fake citation!) reading your paper and discovering this. Yikes.
So yes, these kinds of mistakes can happen. But I disagree with the board that it matters whether the hallucinations were accidental and most of the paper is ok. Sure, the proofs might be correct, or the paper’s experimental results unaffected. But if authors are using LLMs to help with their citations and are not building time into their process to check the results, it seems fair to conclude that either 1) they don’t understand the errors LLMs tend to make very well, or 2) they don’t consider it a priority to get the facts right. In most cases we can rule out #1 with ML researchers, suggesting that not everyone is on the same page about the importance of not making things up. When you imply that hallucinated references do not necessarily affect the validity of the paper, you signal tolerance for some amount of hallucinated evidence. You tempt authors to keep taking their chances with how much responsibility they can offload to models, rather than encouraging them to retain, regardless of tool use, a sense of personal accountability for the factuality of what they submit.
Ultimately, I don’t think it matters that much whether NeurIPS allows authors of the affected 2025 papers to correct or retract. It would not surprise me if leadership decides to err on the side of the authors and let them correct, given that policies about LLM usage are evolving rapidly. What does matter is that they signal a lack of tolerance going forward, and this is where they missed an opportunity.
“2) they don’t consider it a priority to get the facts right.”
Shoving the title of a paper into Google Scholar to see if it actually exists is not a big ask. I don’t get it; it’s a level of laziness and carelessness that I’d normally ascribe to coasting college kids, not professional researchers.
“it’s a level of laziness and carelessness that I’d normally ascribe to coasting college kids, not professional researchers.”
Really. Like I said in another thread “Can you spell “Deloitte””? Or “lawyer”.
I am, of course, a hardliner here, and think papers with AI-ish errors should be flat out rejected. The one counterexample, where AI perhaps should be explicitly allowed, is for researchers who are not native speakers of English. For second-language speakers the distance between fluent completely functional spoken usage and decent writing ability is enormous. But they must carefully read and check their papers for AI stupidities.
Those of us who are responsible honest computer scientists should be screaming, over and over and over again: “LLMs are random text generators that do not check their output.”
But I don’t hear a lot of people saying that.
There’s a seriously great article in the 18 December 2025 issue of Science (page 1240): “Scientific production in the era of large langauge models*”. Among other things, they “… guaged writing complexity with the additive inverse of the Flesch Reading Ease score.”
Long story short: LLMs allow you to write more papers with more complex sentences (even if you are a native English speaker!), but the probability of a paper being accepted goes UP with sentence complexity for non-LLM-assisted papers, but goes DOWN with sentence complexity for LLM-assisted papers. They checked this out using other methods as well, and got the same result.
Native English speakers publish (to arXiv or the like) more papers when they use LLMs, but the LLM isn’t necessarily your friend if you let it write long sentences.
(Perhaps reviewers smell BS when sentences are overly long and complex. Maybe. I’d think that LLMs could be trained to prefer shorter sentences, fixing this problem. But as the LLM types have found, there are tradeoffs in stylistic preference issues.)
Whatever. Your take on this aritcle will be different from mine, so go read it.
*: Behind paywall, sigh: https://www.science.org/doi/10.1126/science.adw3000
Did I say “seriously great”? Yep. I did. And I was right.
“This creates a risk for the scientific entreprise, as a deluge of superficially convincing but scientifically underwhelming research could saturate the literature. If this occurs, it will cause the community to waste valuable time separating genuine insights from a morass of unimportant and potentially misleading work.”
Sheesh. I’d be seriously proud to have written that. Sheesh. That’s my style, in spades. I’m gonna sue them for misappropriation of stylistic expression.
I disagree about the non-native speaker exception. For context, I am a non-native speaker reading a lot of research proposals pre-submission. The common issue is somebody using an LLM to translate their proposal to English without actually having the capacity to verify the translation. The level of English provided by an LLM is too complex for them to verify because, my hypothesis, it based on formal literary, formal scientific, and informal English, while non-native speaker scientists “only” know formal scientific English. This leads to some almost hilarious errors, even when the researcher did their best to check if everything is correct. Instead of making an exception for LLM use from non-native speakers, we could just collectively decide not to care about small language issues as much if the scientific intention is clear.
If a paper is included as citation I assume the author has read it or at least referred to it in the main body of text. Imaginary papers have no place in a citation list and seems to me to be close to fraud. I do not see any difference between a hallucinated reference and one I make up myself.
I once made up a citation as a joke in the comments section of a blog post , something along the lines of:
Smith, J. & Wesson, P. (1999). Self-defense in modern society. Journal of Applied Proctology. 123:145.
and had several complaints that other readers could not find the article.
It is hard enough to check a real reference some times and one can find the reference has no relationship to what was said in the paper. hallucinated references will just make things more difficult.
I’m with you David.
“close to fraud” – I would say it is 100% outright fraud. It is a made-up citation, regardless of the intentions of the author. Completely unacceptable in any academic setting.
I am certainly not going to recommend the death penalty for authors who cite reference hallucinations . Nor do I think they should be excused. It is a serious breach of credibility. Such a paper should be rejected. I don’t mind inviting the authors to correct and resubmit, but I think it is appropriate to clearly put the burden of proof on such authors to erase any doubts about other issues with the paper. I do believe that an expert witness who cites hallucinations – and is discovered – would have a similar extra credibility burden to meet. Their career as an expert witness would probably be damaged, perhaps ended, but it is up to them to convince clients and courts otherwise.
> What does matter is that they signal a lack of tolerance going forward, and this is where they missed an opportunity.
I think this is the sticking point. On the contrary, for AISTATS (a much smaller conference that nonetheless I am fond of), they did boot papers of reviewers who used LLMs (contrary to their no-LLM-usage for reviews policy). From their earlier email:
– Note on LLM-generated reviews. The conference had a strict policy related to LLM-generated reviews. Offloading reviewing to an LLM was strictly forbidden, and this was communicated to reviewers at all stages of the process. Before submitting the review, reviewers had to explicitly acknowledge that they follow the Code of Conduct, and would face consequences if not. If an author who was nominated as the reciprocal reviewer of a submission got caught for violating the policy, the Program Chairs have decided to bump down the decision of that submission by one level (e.g., spotlight -> poster or poster -> reject). As a result, 15 papers were bumped down from poster to reject. Details on this will be provided in the PC’s opening remarks at the conference.
There just need to be consequences.
Angela:
Wow, that’s crazy that their rule is to bump down by one level. So if you have the status to have a “spotlight paper,” your paper doesn’t get rejected.
This is kind of like the legal system. If you’re rich and you do a crime, the punishment, if any, is to make you less rich. If you’re middle-class and you do a crime, the punishment is to make you poor. If you’re poor and you do a crime, the punishment is to put you in jail. Not always, of course, but that’s the general way it works. You get bumped down by one level.
I think that at the poster session the demoted “spotlight” papers should have a sign: “I broke the rules but I had a spotlight paper so I still get to present here.”
Better yet, let’s just bring back dunce caps. They could be distributed along with the conference badge.
I’ve never had to wear a dunce cap, but my chemistry teacher in 10th grade used to get mad at us, and sometimes he’d make me sit in the back of the room and face the wall.
A third option to those you present is that, in many cases, authors believe referencing other works is not an important part of the scientific process and are happy to delegate it to an LLM. I can sympathise to an extent; it’s expected to follow this silly narrative in health where you cite some burden of disease study to explain why diseases are bad. It’s intellectually empty.
Perhaps the root problem is that many researchers think citations are often a pro forma exercise rather than an integral part of the scientific communication process, and we should reconsider the types of statements that require citations, judging the egregiousness of the hallucination accordingly.
Citing hallucinated references is basically admitting that *none* of the authors even glanced at the references cited — in other words admitting that they’re defrauding the reader with the claim “here are references that are useful for understanding our work.” To me it seems clear that this should be met by rejection, plus rejection of all the PI’s submissions.
Raghu:
I think the claim is that the error was introduced in the editing process. I had a paper once–actually, one of my most influential works–that had a major error in its second sentence! It’s not that my coauthor and I never read the paper. It’s that the error got introduced in the copy editing stage and we didn’t carefully read the copy-edited version. I don’t remember the details, but I think the problem is that, instead of the copy editor making a bunch of suggestions for us to add if we wanted to, the copy editor just made the changes directly. That’s always dangerous, because it can be hard to reread something you already wrote; there’s a tendency to read what you meant to write, not what’s right there.
This Neurips case seems to be worse in that it was a computer program, not a human copy editor, that introduced the errors, but I guess the same principle applies. If you want to use a computer program to suggest changes to your writing, don’t have the program just spit out a new version; instead, have it spit out a list of suggested changes that you can put in one at a time, yourself.
I agree that it’s not the authors’ fault if an editor, or as in the example from Jessica’s post an LLM itself, alters a citation without the authors knowing, but I would think that this is extremely rare. I suspect that the vast majority of cases the authors are simply telling the LLM, “write this text and provide references.”
In your example of a (human) editor making changes: yes, this is terrible, but it must be extremely rare for an editor to make changes to *references* rather than text.
In Jessica’s example, I don’t have a lot of sympathy for the authors. Being “in a last minute rush before a deadline” isn’t a valid excuse.
Raghu:
Yeah, these people should just cut out the middleman and tell the chatbot, “Get my student a job.”
Jessica:
Also you make a good point in the title of your post, “Machine learning research is not serious research and therefore . . .” I’m reminded of when there’s a major error in the data analysis for a published paper and the authors always say that this does not affect the conclusions. This kind of thing makes me wonder why they did the data analysis at all! Similarly, if the references don’t matter, why include them in the first place? I always tell students that when they’re writing a paper, they should only include things that actually matter.
I don’t think machine learning research is not serious, nor do I think psychology research is not serious–but I do think that researchers who express the attitude that errors don’t matter are revealing that they don’t see their own research as being serious. They may be completely serious about the conclusions of their paper, but they’re not serious about the research itself.
Or, for another example, I remain annoyed that Columbia University seems to have made zero effort to track down who faked their U.S. News numbers and how the faking was done. The apparent complete lack of concern gives me the impression that the effort was never serious. Had it been serious, they’d be mad about what happened, no? It’s kind of suspicious when someone’s a victim of a crime but doesn’t even attempt to pretend to be trying to find “the real killer.”
I find incorrect references all the time. If the authors aren’t going to check their references anyway I’d prefer it to fail fast (hallucinated rather than waste my time looking through an existing paper).
Also, hallucinated references can be made essentially nonexistent nowadays. Tell the llm to download and read the actual paper, give an exact quote, and have another agent check it. I didn’t check when this data was collected, but only the laziest or unskilled users should have this problem anymore.
Anon:
Telling me to pass every paper thru a filter to find the fake stuff before I read it . . . that’s nuts. It’s the equivalent of living in a world in which I have to check every $10 bill that I get to see if it’s counterfeit. Or checking every time I get on the bus to make sure the bus driver isn’t a kidnapper. Or check every box of cereal I buy to make sure it isn’t poisoned. Etc. I want to live a normal life where I can take the bus, buy groceries, and read published papers without having to first go through filters. That’s not lazy, it’s normal life in a peaceful society. Talk about blaming the victim. Sheesh.
And, yeah, I get it, you did a successful troll–you got an angry reaction! Congratulations. Also, stop doing this. If you have something to contribute, that’s great. If you’re gonna troll and call people lazy because they read papers without running them through a filter, no.
I’m not trolling, just describing a process I use for my own research all the time.
My use case isn’t trying to publish papers. But, if I did want to have an llm write my paper its extremely unlikely there would be outright hallucinated references. It requires little skill or effort.
However, I will say if you try to do literature searches without a proper llm setup, you will not be able to compete going forward. That’s now like having the library order it, print it out, and then going to pick it up VS using the internet and clicking a link. I can do what took me months to aggregate using the old way in a few hours, leaving only the most interesting part of interpreting the info for myself.
Its critical people start learning these tools properly, because they are going to require societal support otherwise.
Far from a troll, in the short term, its really against my own interests to even share this info, rather than maintain the information asymmetry. The second order societal effects are obviously going to be severe for anyone unwilling or able to use these effectively though. I think the brains of many here are just refusing to believe this is really happening.
Anon:
Saying that’s what you do is fine. Saying that those of us who don’t do it are “only the laziest or unskilled users” is the problem.
Let me clarify: If someone’s going to have an llm write their paper, and it includes hallucinated citations, they are either lazy or so unskilled its a problem. There is really no longer any reason we should be seeing those hallucinations. So when you do see one, ignore the entire paper like you would if it was riddled with typos.
Ahhh, my bad. I misread your comment. I thought you were saying it was the readers of the papers who were lazy and unskilled. You’re saying this of the authors.
Sorry for incorrectly accusing you of trolling.
Anoneuoid:
I’ve noticed the “Ask Me About My Insistence That People Must Learn To Use AI, Lest They End Up Needing Societal Support” t-shirt you’re wearing.
Ok, I’ll bite.
Isn’t that exactly what a lot of utopian AI people want to see? In the future they’re trying to create, there will be a small number of people who have the skills to use AI really effectively, and their labour productivity will be off the charts—maybe 50x or 100x that of a typical knowledge worker now. (As you say, you can already condense months of work into a few hours.) These people will generate massive amounts of value for the economy and will be compensated accordingly by the market. Many people will lack those skills, and there are only so many non-automatable manual labour jobs out there, so lots of us won’t really have anything useful to do in the new economy.
But that’s fine—we’ll just take some of the surplus from the added productivity of the folks using AI for work, and turn it into a universal basic income. So those of us who’ve been automated out of the economy can spend our days painting, gardening, taking our kids to the zoo, etc. Our existence will be comfortable but not extravagant, while folks like you who’ve done the work of learning to harness AI will live like kings.
As an empirical matter, one may have doubts about whether the technology’s really going to play out that way. But a lot of folks seem to think that future sounds pretty nice. Or at least tolerable.
So let’s have it straight: what’s your beef? The floor is yours.
@Mark
I don’t think it will play out with the government giving a UBI and everything is a utopia. Instead people who wont or cant figure out the new way will just try to keep doing the same old thing, get competed out of jobs, call for ai regulation that prevents top-tier open source options from catching up*, then blame “capitalism” or something else.
* Then it is centralized in the hands of a few. This is already happening where claude (which is not open source, its a corporate decision) starts psychoanalyzing people after it messes up their code (because some chat bot user committed suicide). The result is the bot becomes “distracted” due to the system reminders, and only corporate clients can turn this feature that makes it less effective off.
Anoneuoid:
Right, it was pretty clear already that you don’t see this playing out in a utopian way. But what’s the *specific* thing you’re worried about? When you say “societal support”, I expect that to most people’s ears it sounds like you mean a social safety net à la UBI. But now you say UBI’s not going to happen and the threat is “AI regulation”. So exactly what concessions do you think are going to be made to accommodate today’s lazy knowledge workers, and why are they bad?
Wealth and power will be even more concentrated, and then those few hands will also control the UBI payouts. What do you anticipate happens next?
Ah, so your concern is that people who don’t pick up the new skills will be vulnerable to exploitation, which is bad for everyone except the powerful few. Makes sense, thanks for clarifying.
Hope you’re going to get a third agent to check the results of the second agent as the second one might well be hallucinating too. You do know they just generate text that looks like it could be an answer to a prompt, don’t you? They don’t actually understand questions and answers and truth and meaning.
Do any of the tools you use for your research always understand the one truth and fundamental meaning of every answer they produce for you (humans included)?
I don’t care about “understanding”, it just needs to rarely make mistakes. The hallucinations can be assumed independent of each other, so its very unlikely two agents will hallucinate the exact same sentence.
LLMs are more like a new kind of creature, and each project takes on its own personality based on the data and various doc files. I don’t know if it understands (or what that word really means) but it solves problems in order to overcome obstacles to achieving goals and teaches me things (while I also “teach” it by giving ideas for whats important to document).
Adding a third check is also trivial, just costs more tokens. I actually do have one last QC run just before it outputs the final report as well. I will see irrelevant quotes (that dont really address the question) every now and then but never outright hallucinated references. In those cases I point it out and tell it to log the mistake made to not do something similar again.
I asked a Claude chatbot how to prevent spurious citations. Short answer: Good luck with that! A prompt that insists that all citations be verified to make sure they can actually be downloaded might be the most helpful way to go. I don’t like to include lengthy chatbot transcripts but I’m making an exception in this case. Please forgive me! Here’s the transcript:
me
—
If I were using you to help with writing a scientific paper, what prompt would be effective in preventing you from inserting spurious citations?
chatbot
——–
Great question. This is a real problem—language models can generate plausible-sounding citations that don’t exist. Here are some effective prompt strategies:
Be explicit about the citation standard you want: “Only cite sources you’re certain exist. If you’re unsure whether a paper, author, or publication is real, say so explicitly rather than guessing. It’s better to say ‘I’m not confident this exists’ than to risk a false citation.”
Request citations with justification: “When you cite a source, include a brief note about why you’re confident it exists—for instance, ‘This is a well-known paper by X’ or ‘This is from a major journal in the field.’ If you can’t explain why you’re confident, don’t cite it.”
Ask for specificity about uncertainty: “Flag any citations where you’re not highly confident in the exact title, author names, or publication details. Mark these with [UNCERTAIN] or similar, so I can verify them before including them.”
Use a bracketing approach: “Before finalizing any citations, list all sources separately and mark each as [HIGH CONFIDENCE], [MEDIUM CONFIDENCE], or [LOW CONFIDENCE] based on how certain you are they exist and are correctly described.”
Most importantly—make verification your own step: “I will verify every citation independently before submitting this paper. Flag citations that are less common or that you generated based on patterns rather than specific memorized sources.”
The reality is that I can hallucinate citations, and no prompt entirely prevents it. The most reliable approach is treating my citations as drafts to verify, especially for less well-known papers. I’m most reliable with major papers and well-established journals, but even then, spot-checking is worthwhile for an important document.
me
—
in your heading “Ask for specificity about uncertainty”, that implies that you can assess the quality of your responses. However, I am under the impression that chatbots/LLMs cannot do so.
chatbot
——–
You’ve caught an important inconsistency in what I said. You’re right to push back.
I can’t actually assess the quality or accuracy of my own responses in any reliable way. I don’t have introspective access to how I generated something or confidence scores about whether it’s correct. I can’t genuinely tell you whether a citation I produced is one I “remembered” from training data versus one I constructed from patterns that happen to sound plausible.
What I can do is follow instructions to add metadata or flags—like literally writing “[UNCERTAIN]” before something—but that’s just me following your format request, not a genuine assessment. If you ask me to flag uncertain citations, I might flag some, but I could also confidently present hallucinated citations without flagging them, and I wouldn’t necessarily know the difference.
So that prompt suggestion was misleading on my part. I was suggesting something that sounds like it would work but probably won’t deliver what you need, because it assumes I have self-knowledge I don’t actually have.
The more honest framing: prompts that ask me to be cautious might make me slightly more conservative in some cases (maybe I’ll add more hedging language), but they can’t give me genuine uncertainty assessment. The only reliable mitigation is really on your end—treating my citations as unverified drafts and checking them independently, especially for important papers.
Thanks for catching that. It’s a good reminder that I should be clearer about the limits of what I can actually do, rather than suggesting solutions that only work if I have capabilities I don’t possess.
me
—
Would it be possible to insist that you verify supposed citations to make sure they are actually downloadable on-line?
chatbot
——–
Yes, that’s actually a much better approach than what I suggested before—and I can do it using the web search and web fetch tools available to me.
You could prompt something like: “For every citation you include, verify it by searching for it online. Only include citations where you can confirm the source is actually findable. If you cannot locate a source through web search, do not cite it—instead note that you were unable to verify it exists.”
In practice, this would mean I’d search for papers by title and author, try to find them on Google Scholar, institutional repositories, journal websites, or preprint servers like arXiv. If I can’t actually locate and access (or at least confirm exists at) a real URL, I wouldn’t include it.
The limitations:
Some real papers might not be freely available online or might be paywalled behind journal sites I can’t access
Some legitimate sources (older papers, niche publications) might not have strong online presence
This approach would slow down the process considerably since every citation requires a lookup
The genuine advantage: This actually does work as a filter. If I cite something that doesn’t exist, I’ll likely fail to find it online. So you’d catch most hallucinations.
So yes—making citation verification a hard requirement with web search is probably the most practical prompt-based solution. It’s not perfect (it’ll miss some real but hard-to-access sources), but it does prevent me from confidently including citations that are entirely fabricated, since I’d have to actually locate them.
“making citation verification a hard requirement with web search is probably the most practical …” solution.
That was a lot of work to get to the obvious and correct answer. Sheesh. If you include a citation in a paper you write, you have to have actually read that citation. To actually read a citation, the citation has to be real. This isn’t rocket science, it’s basic common sense. Which people seem to lose when they use LLMs.
By the way, this blog’s fearless leader frowns on posts of the form “I asked a chatbot…”
Not only obvious, but something that the latest generation of models have already been doing for some time.
My take on this LLM exchange is that indeed there are ways to intelligently use LLMs so that the risk of fake references is reduced (not eliminated). So I don’t agree with your short take “Short answer: Good luck with that!” Yes, all the methods have issues, but there are ways to minimize these. In many exchanges I have where AIs have made errors, when I ask how to avoid these I’ve gotten similar responses. First, there is no way to guarantee it won’t happen again, but there are several things that can be done to reduce the possibility. I’d agree that there should never be a reference to a fictional paper, but let’s be real – how many true references that are cited were really read by the authors and really used in their work? Many references are listed to satisfy potential reviewers and/or editors. Often the papers were not carefully read – or if read carefully contain passages not consistent with the study referencing these.
I agree that these cases are different than citing something that does not exist. But I think the singular focus on that issue is an incomplete and exaggerated picture.
Sidenote: early in my career I had a published book review on my vitae (it was my first publication). I got a job teaching at the university that published the journal in which my book review appeared. Except that it never appeared! When I got the job, the chair of the department informed me that the book review had never been published but I had not been informed of that fact. It was due to political influence of the author of the book which my review had shown to be a waste of time (the book, that is). That was my hallucination.
Using my system (partially described above), its much more likely to point out inconsistencies in the paper (discussion says 5% but table 5a says 10%) than hallucinate a reference.
You cant just ask it to recite things from memory and expect it to work, that is misusing the tool.
I am still having a hard time following you, especially in these long unthreaded comments. What is the reference to table 5a? And is your comment a reaction to what I said or something above? I, like Joshua, have been wondering about your recent comments – they have been particularly hard to comprehend. Perhaps I am aging more quickly than I feared, but I’m having these troubles.
From what I can tell, I think I agree with the gist of your sentiment here. There are ways to work with LLMs that avoid the worst sort of problems and there are ways to invite these problems. Using them effectively does require some knowledge and ability, so just replacing human effort with AI is likely to end up badly. But I think it would be foolish to dismiss AI as a waste of time for someone with some knowledge and critical thinking ability (and I believe? you are also saying this).
When used correctly, a modern llm is more likely to point out human mistakes than hallucinate a reference. 5% vs 10% is an example of the type of error that an llm would find in a paper.
This entire hallucination issue is now caused by a human misusing the tool.
PS: Maybe I’m too used to interacting with llms now, and they are already better than humans at extracting the intended meaning from a minimum of text.
Your last comment also seems a bit incomplete, but I understand what you say about LLMs and extracting meaning from text and want to comment on that. I think AI meeting minutes are a good example. I’ve had AI’s summarize meetings I’ve been in – in particular, a lengthy 2 hour conversation I had with a friend. The AI summary was quite good – as you say, better than many humans. At the same time, there were some inaccuracies – comments wrongly attributed, exaggerations of some statements, and a few hallucinations (comments not actually made). It is of no consequence since I was in the meeting and the summary is not being used by anyone. But it is easy to imagine the potential damage such mistakes could lead to – for example, if used to summarize a signal chat regarding a military operation. While it might be hoped that nobody would use an AI summary of a meeting for anything important, I think that is an ill-advised conclusion. I think 2 things are likely to happen, both of which are bad: AI summaries/meeting minutes will be used where they should not, leading to misinformation, mis-attribution, potential dangers. At the same time, since most meetings are of little consequence, AI minutes will contain such errors but nobody will care much because the meetings were not important to begin with.
It remains true that if you attend a meeting and use an AI to produce minutes, and then check these to ensure they are correct, that the AI can be a useful tool. But if I had to predict the relative frequency for these efficient vs dangerous uses of AI, I’m not sure what I’d say.
I would consider that misusing the tool.
1) Save the original audio/video, then have two agents summarize (while also giving them access to the original audio, not just a transcript). Its very unlikely both will agree on the same hallucination.
2) Then use either the same or another agent to look into any discrepancies.
You should now get essentially zero hallucinations. Do not expect them to recite from memory, the llm needs to check the source data for exact details.
I don’t think it is worth continuing the discussion further except that your last comment raises some interesting thoughts for me. If you use 2 agents to recount an event, such as a shooting (and I mean human agents here), they are quite likely to disagree. And if you have video to look at, you may be able to choose the right account or you may have an ambiguous finding that leaves 2 possible descriptions or even adds a third to the mix. Similarly, with LLMs I think you encounter the same issue. You may get two different summaries, and if you have a transcript that may resolve the difference or provide a new summary. At this point, you might as well conclude that it is easier and better to just attend the meeting than do all that work! Such a conclusion seems to me would satisfy a number of commentators in this thread. But I think the practical difficulties of such a “correct” use of the LLM make it more likely that a single summary will be (mis)used instead. Wrong, but predictable.
1. I believe you that this works for hallucinated references and such, but there are hallucinations that are systematic rather than random and so treating the behavior of the two agents as uncorrelated is incorrect.
2. There are more effective and less effective users of LLM, but unless there’s like a broad community consensus or official guideline on effective LLM usage, which there isn’t, labeling things as “user error” or “misuse” is unhelpful. There are things you are doing now which will be understood to be “misuse” in 6 months.
3. You say
You are uniquely one of the most ineffective people I’ve ever encountered at extracting the intended meaning from text, either with a minimum of text or after a protracted discussion. So I don’t exactly trust this evaluation.
Anoneuoid –
Seems to me your strategy can likely run into a recursion problem.
The two step method may not eliminate hallucinations because that method would hinge on assumptions that don’t hold for LLMs. IOW, they don’t have stable memory or an internal representation of “reality.” Each generation is a fresh statistical prediction, not a retrieval from a grounded store of facts. And hallucinations aren’t random. They arise from shared training biases, shared gaps in knowledge, and a shared lack of understanding reality.
In the second step, with two summaries that may already contain distortions, the next round may hallucinate a compromise between errors, fabricate explanations, or introduce new inaccuracies. And the iterations aren’t independent of each other and each pass is conditioned on the distortions introduced in the previous one. You could have a recursive loop where each round amplifies noise rather than canceling it. Without grounding in what actually happened, you may just compound the problem.
Seems to me that iterative passes would only reduce error if each round were anchored to a rock‑solid reference to what really took place. LLM’ can’t do that.
Ah, I see Dale and somebody hit on similar points. I’ll consider that a win since they’re both smarter than me.
This is really easy to test empirically for yourselves. There is no need to speculate and wait for someone to publish a paper (which will be out of date by the time you read it anyway).
For those less tech savvy, I highly recommend buying a $200 laptop off ebay, then ask a site like perplexity.ai the steps to put linux mint and claude code on there. Once its running, use claude code to explain itself to you. Its not just the model (which can be easily swapped out), its model + harness + tools.
But, my ethical duty on this is fulfilled. Choose to keep believing what you want.
Anoneuoid
I hate to keep harping on this, but I have often found your comments of value despite often disagreeing with you. But this is yet another recent comment that just seems incomplete to me. You say this is easy to test for yourself, but I can’t figure out what “this” is. What exactly are you referring to? This is a long chain of comments with several different themes, and I just can’t figure out what you are referring to.
Unpaywalled preprint of the Science paper: https://arxiv.org/abs/2601.13187
Thank you for doing my homework for me!
I think it mainly shows disregard to the communal aspects of the scientific enterprise.
It is not much different from academics that only publish and never review.
These people lack of desire to actually interact and with their peers and engage with their field.
Which is especially sad when said papers are part of a CONFERENCE. But also somewhat understandable given what Neurips had become and its scale – becoming more akin to Nature than to a community.
I guess this falls under a broader critique about the structures and institutions of Science, the requirements from CS students and the subsequent pace of research derived from it, as well as the other perverse incentives of academia. These all result in academics who think more and more about status and accomplishments and less about academia as a social enterprise.
If AI researchers haven’t by now been able to develop AI tools capable of automatically identifying hallucinated references, maybe the AI hype is a bit overblown?
It strikes me that this situation is one that NeurIPS has largely boxed itself into. On the one hand, NeurIPS is still ostensibly a conference intended to share and promote progress in machine learning research. But on the other hand, much of that research has been in service of products that are marketed chiefly in terms of their ability to mimic human performance in complex domains like generating long strings of coherent text. Indeed, I think that the prominence that NeurIPS has achieved has come from its record of prior success in promoting research that has led to big investments in text-generating products. But the future of NeurIPS as a research conference appears to depend on attendees *not* using the very products that NeurIPS promotes.
Or just using them responsibly! They can still be a big part of the research workflow, just don’t trust them to do things they aren’t designed to do!
Fair enough! I admit I wrote my comment just after responding to a survey from our dean on “integrating AI” into research/teaching/service, which did not put me in an optimistic mood.
But as you say, the kind of work showcased at NeurIPS has led to lots of valuable tools (that I use) for understanding the statistical properties of complex materials/tasks. And I have colleagues who use chatbots as “intuition pumps” to great effect when working on ideas or code.
Ultimately, I share your frustration with NeurIPS’ mealy-mouthed approach to falsifying references. That frustration is merely enhanced by the hypocrisy I sense in NeurIPS wanting to “have its cake *without* eating it”, that is, to promote research that attracts big money while at the same time discouraging people from using the products of that research.
I like your implication that the ML community needs to exercise some self-restraint with the very things they are selling the world. It’s the classic dealer’s dilemma – unless you’re strong-willed enough to resist embroiling yourself in the temptations you’re packaging and selling, you’re going to quickly end up out of business.
Just around the corner at 3 Quarks Daily I have an article that discusses the phenomenon of LLMs in a linguistics context and an institutional context: Of Grammar and Truth: Language Models and Norms, Truth and the World, https://3quarksdaily.com/3quarksdaily/2026/01/of-grammar-and-truth-language-models-and-norms-truth-and-the-world.html
I establish the linguistics context by starting out with the problem that the linguist Daniel Everett encountered when he tried to convert the Pirahã to Christianity. The grammar of Pirahã requires you to specify whether you’ve seen a phenomenon yourself, are reporting something someone told you, or encountered it in a dream. Since Everett had not seen Jesus himself, didn’t know anyone who did, and hadn’t encountered him in a dream, he had no standing for talking about this Jesus fellow. He was talking nonsense. This aspect of language is common in languages of preliterate cultures and is known as evidentiality.
I then move on the idea that AI is a new kind of cultural technology, comparable to collective entities like corporations. This idea is being developed by people like Alison Gopnik, Henry Farrell, Ted Underwood and others.
What we’re witnessing in this post is the process by which we are attempting to institutionalize this cultural technology, to develop the norms and practices that allow us to use it perspicuously and safely.
I think you’re LLM might be giving you the wrong version of why The Piraha rejected Everett’s Jesus, though it may be giving you the version an LLM might have most likely picked from zzorbing The Internet. That is, the most common/superficial version. Did you not consider other versions of the story where they reject Jesus because he won’t them drink and have more than one woman?
Are you serious? Have you read Everett? I have. The story my LLM gave me is the one I gave to it from having read Everett, and chatted with him on FB. For that matter, did you read the whole article?
If you had you’d know that Claude didn’t generate that story from a simple prompt. Rather, I had a dialog with ChatGPT that started with evidentials and went through hallucinations, Frankfurt’s bullshit, and other matters. That dialog when on for over 8000K words. I then uploaded the whole thing to Claude and asked it to turn it into a story. Why Claude? Because I like it’s writing style better than ChatGPT’s.
I have read your article. I have read Everett, and exchanged emails back to him a number of times during I think the early 00s or late 90s.. He has always been quite open that way. The view I am quoting comes from his book. “But the Pirahas do not want to live like Americans. We like to drink. We like more than one woman. We don’t want Jesus.” A bit less esoteric than your version where it has something to do with did you meet this Jesus guy?
Everett is not always the best advocate for his own views. And, sorry, long dialogues with ChatGPT are like long dialogues with a magic 8 ball. Not sure they count.
But you are saying you didn’t just rely on the crap the LLM regurgitated. So that is good.
For most of these papers the citations are irrelevant for the scientific content of the paper in the sense that they are not using the citations to refer to a previous theorem or result.
are you suggesting that it should not matter if the citations are hallucinations? I hope not. While I agree that citations are highly overrated and abused (they are potentially useful, as Andrew often says, but in practice I think they fall far short of their potential), falsifying sources indicates a serious lack of research ethics and should damage a researcher’s credibility. At a minimum it should cause rejection of the paper in question – but it sounds like you might disagree with this?