This post is from Bob
Even though the post is from me, the story is partly about Andrew, but mostly about how LLMs are getting better at math. Rather than telling everyone about LLMs and having a theoretical discussion, I thought it’d be useful to show you another example of the kind of interaction with LLMs that I find super useful.
The backstory
Andrew frequently mentions that he’s impatient with Stan’s default sampling times and would like something to give him a rough answer faster. So I and several colleagues have been working on finding faster adaptation (for now, I would recommend Adrian Seyboldt’s fast adapting Nutpie sampler, which can already be run with models coded in Stan or PyMC). That means evaluation. So I finally asked Andrew for an example.
Gelman’s measurement error model
Here’s the measurement error model that Andrew gave me:
data {
int<lower=0> N;
vector[N] y;
vector[N] x_star;
real<lower=0> sigma_x_star;
}
parameters {
real a, b, mu_x;
real<lower=0> sigma, sigma_x;
vector[N] x;
}
model {
x ~ normal(mu_x, sigma_x);
y ~ normal(a + b*x, sigma);
x_star ~ normal(x, sigma_x_star);
}
As an aside, I love that there’s an appropriate lower bound on N.
This is a classic measurement error model—if you’re not familiar, there’s a short introduction to exactly this flavor of model in the Stan User’s Guide introduction to measurement error models.
Bob’s revised model
This looks like a BDA type model—there are improper flat priors. And the latent true values of the covariates x have a centered parameterization. Both of these can cause problem fitting, so I swithced to a non-centered parameterization using offset and multiplers, then added some weakly informative priors for all the parameters.
data {
int<lower=0> N;
vector[N] y;
vector[N] x_star;
real sigma_x_star;
}
parameters {
real a, b, mu_x;
real<lower=0> sigma, sigma_x;
vector[N] x;
}
model {
a ~ normal(0, 1);
b ~ normal(0, 1);
mu_x ~ normal(0, 1);
sigma ~ lognormal(0, 0.5);
sigma_x ~ lognormal(0, 0.5);
x ~ normal(mu_x, sigma_x);
y ~ normal(a + b * x, sigma);
x_star ~ normal(x, sigma_x_star);
}
It fits better than Andrew’s but still not well. With the default 1000 warmup and 1000 sampling iterations on four chains, there were still R-hat values above 1.2. If we crank that up to an order of magnitude more iterations, everything looks OK, but it still has an integrated autocorrelation time of around 100 (meaning ESS is 1/100 times the sample size).
GPT 5’s revised model
Looking at the model, I thought like it looked like a prime candidate for marginalizing out the x parameter—it has a normal prior and it’s used in a normal likelihood. But I’m terrible at this kind of math and there weren’t any math stats folks around like Philip Greengard, so I turned to GPT 5, which had already done some math for me I couldn’t do myself (calculating the second and fourth moments of Neal’s funnel). Here’s my interaction with GPT 5.0 set to “auto”. GPT 3.5 could barely write a syntactically correct Stan program for linear regression, so let me just say it’s come a long way since then.
[Bob] I have the following Stan program for measurement error in x and I’m wondering if there is some way to marginalize out some of the variables to make this sample more efficiently. I figured since everything is normal there would be some chance of this. In particular, I would like to marginalize the x out if possible.
```stan [code from above] ```I would be OK slightly modifying the priors on the sigma to be inverse gammas on sigma^2 if that would help grease the wheels of the marginalization.
And lo and behold, the genie returned a correctly marginalized Stan program:

data {
int<lower=0> N;
vector[N] y;
vector[N] x_star;
real<lower=0> sigma_x_star;
}
parameters {
real a, b, mu_x;
real<lower=0> sigma, sigma_x;
}
transformed parameters {
real inv_var_x = inv_square(sigma_x);
real inv_var_xs = inv_square(sigma_x_star);
real tilde_v = 1.0 / (inv_var_x + inv_var_xs); // Var(x | x*)
real<lower=0> sd_xstar = sqrt(square(sigma_x) + square(sigma_x_star));
real<lower=0> sd_y_cond = sqrt(square(sigma) + square(b) * tilde_v);
vector[N] tilde_mu = tilde_v * ( inv_var_x * rep_vector(mu_x, N)
+ inv_var_xs * x_star );
}
model {
// hyperpriors (use whatever you prefer; half-normal/lognormal work fine)
a ~ normal(0, 1);
b ~ normal(0, 1);
mu_x ~ normal(0, 1);
sigma ~ lognormal(0, 0.5);
sigma_x ~ lognormal(0, 0.5);
// marginalized likelihood
x_star ~ normal(mu_x, sd_xstar); // p(x*)
y ~ normal(a + b * tilde_mu, sd_y_cond); // p(y | x*)
}
Followed by this advice:
- [GPT5] Integrating out μx as well is possible with a normal prior, but it induces dense cross-observation covariance (rank-1 update) and is typically worse for HMC/NUTS. Keeping μx as a parameter is usually better.
Did it work?
Yes! Not only did it work, it worked on the first try. And the results matched what I got from a much longer run of the full model, but with a great ESS.
GPT, can you explain to our users at home how you did that?
I was thinking this kind of operation would be useful to include in the efficiency chapter of the Stan User’s Guide. So I asked the genie to explain itself.
[Bob] This is great and it worked exactly as advertised and had an effective sample size for estimating parameter means of about half the total number of sampling iterations and agreed with a long run of the original model. Now I would like to write this up as an example for our users in the efficiency chapter of the [*Stan User’s Guide*](https://mc-stan.org/docs/stan-users-guide/efficiency-tuning.html). So I want an explanation pitched for a similar audience. Given that we’re going with the factored conditional form, could you go into a bit more detail about how you derived the forms of the four variables you have, the constants $\tilde{v}$ and $\mu_i$, and the distributions for $\x_i^\star$ and $y_i \mid x_i^\star$? You can break it down into known properties of normals, though please remind me of what these are because I’m not very good with classical math stats manipulations of exponential families, nor will many of our users be.
And, of course, with an LLM, Bob’s your uncle (from the French idiom et voilà):


Of course, I will have to double check all the math until I understand it thoroughly, then convert to the Stan User’s Guide style.
I’ve been using Codex CLI (from OpenAI) and GPT-5 (through API, reasoning set to ‘high’) for coding for a few weeks now, and have been relatively impressed. That’s kind of repeating common knowledge though.
More relevant here, I don’t see any reason why the agent wouldn’t be able to iterate on model fits.
I’m not sure how to exactly operationalize this idea though. At least, if I have a family of model structures, it could iterate around it and use convergence diagnostics, LOO, posterior checks in some textual form, etc., for evaluations, and after the loops give me some kind of summary of what works and what doesn’t. Almost never-ending runs, as you often see with bad likelihoods, may be a problem and require some kind of wrapping.
I _think_ I have some tacit, hard-to-formalize skill or intuition for this process that the model lacks, but we will see. Even if its modeling decisions turn out to be silly, a similar approach with more guidance would probably be useful.
(I’m doing this for prediction, but with little data and models that probably need random effects sooner or later, so I’m using sampling.)
PS: After linguistic check with Claude, I asked it which blog I’m commenting on: “Ha! Based on the Bayesian statistical modeling terminology (LOO, posterior checks, convergence diagnostics, random effects, sampling), the informal academic tone, and the focus on prediction with sparse data… I’m guessing Andrew Gelman’s blog? […] Am I warm? 😉”
Bob,
Thanks for doing this!
For readers interested in the measurement error model, it comes from this long post from a few years ago, Measurement error model Stan fitting struggle: The funnel again rears its ugly head. I think it will also appear in our forthcoming Bayesian Workflow book.
Regarding the marginalization of the model: that’s a good idea in practice (and it was done very effectively by Philip Greengard on our fast nono paper). I prefer brute force with latent variables when possible, just because then it’s easier to add complexity to the model. Indeed, that’s why in the 8-schools model in Stan we fit it without marginalizing. In chapter 5 of BDA we marginalized because back when we were writing BDA it was a pain in the ass to work with a 10-dimensional posterior, so it was super convenient that we could integrate 9 of the dimensions out. But now we rarely do those integrals. If you look at the econometrics literature and the classical statistical literature on measurement error models, the standard approach is to express it as a joint normal distribution, and then it’s obvious how to integrate everything out. But from a Bayesian perspective I like to keep the latent variables in. So what I’m saying is that I really hope Walnuts can solve these!
Some more background from my earlier post linked above:
Also, I didn’t see this mentioned in Bob’s post, but the performance of the computation depends a lot on the value of parameter sigma_x_star, which is specified as data. (In this model you can’t jointly estimate sigma_x_star along with all the other parameters from the data (x_star, y) alone. That’s how this particular measurement error model works; you need to know (or have some information on) the scale and distribution of the measurement error.
When I sent the Stan program to Bob as a test example, I also sent a .json file with the data, including sigma_x_star. For completeness it would be good for Bob to include this data file in the above post so that readers could try it themselves.
But the short answer is, yeah, I wasn’t trying to fit this model for any particular example; I was trying to lay out the measurement-error model in terms of latent data and then fit it, so I’d prefer to do that without integrating it out, which reduces the generality of the example as a guide to future, more complicated, applications.
Here’s a link to the Stan program and data for the measurement error model on the GitHub for WALNUTS (it’s on branch benchmarks):
https://github.com/flatironinstitute/walnuts/tree/benchmarks/examples/models/meas-err
If you go up a directory, you will also see the parks example (centered and non) that Andrew gave me as an example of where he gets impatient at how slow Stan is. The problem I was finding with the measurement error model is that I wanted 10K reference draws and it took a couple minutes to get around ESS 20 with 10K sampling and 10K warmup iterations. I didn’t trust the ESS estimate and didn’t want to wait 10 hours, though I eventually will do something to define the true moments for a test case—I’ll probably have to use the marginalization.
I’ll report back soon on whether WALNUTS can handle this model. Like Nutpie, whose adaptation it’s borrowing and slightly modifying. The code is still fairly readable C++17—here’s the API and driver for the adaptive form with the continuous form of Nutpie’s adaptation.
https://github.com/flatironinstitute/walnuts/blob/benchmarks/include/walnuts/adaptive_walnuts.hpp
Eliot Carlsen, Adrian Seyboldt, and I are trying to write this all up. The short story is that the initialization using a regularized gradient outer product and the use of both draws and scores of draws is a game changer—I’m seeing as much as a 4-fold improvement over Stan in getting to 0.1 error on first and second moments—if all you care about are first moments, it’s way faster. It also starts adapting right away like in Nutpie.
You can already use Nutpie for models coded in Stan or PyMC (and probably in other ways):
https://github.com/pymc-devs/nutpie
Sadly, WALNUTS isn’t going to solve a bad conditioning problem—at best it will help you modify the step size dynamically if necessary. The problem is that looks like it gets in the way if a constant step size works OK, as for example, in a highly correlated multivariate normal. We’re still diagnosing (hence the branch), but I fear this is going to come down to needing local mass matrix adaptation.
Ooo, I suspect the blog ate the less-than-greater-than syntax (bounds and offset multipliers). I’m seeing them in the source, thanks for linking
Great post.
For the complaints about the bad rollout of GPT5, I have actually been pleasantly surprised with what I’ve asked it so far, especially compared to prior versions. Maybe some combination of asking easier questions or it giving better answers. Regardless, I wholeheartedly agree that math should be double-checked, even though it’s gotten better it’s not perfect.
John:
All Bob’s posts are great. My main role on this blog is to provide a baseline level of daily posts which Jessica, Bob, Lizzie, Shira, etc., can exceed whenever they feel like sharing something here.
Well it is certainly much appreciated, regardless!
> from the French idiom et voilà
From? Idiom?
That was a linguistic joke. The translation of “Bob’s your uncle” in French would be “et voilà.” Being named Bob, I thought people were pulling my leg (another idiom!) when they said, “Bob’s your uncle.” This was before the web, so I couldn’t easily fact check my classmates, who were merciless in making fun of my American English.
Thanks. I know the English idiom and the French expression that it may be used as a translation but I was confused by that choice of words.
Here is one claim that AI does not improve coding efficiency: https://substack.com/home/post/p-172538377. I’m not saying it is a reliable study and it actually does not match my beliefs, but I thought it is worth noting some experience contrary to Bob’s. It might be worth examining whether the difference is due to the humans involved or the types of tasks being done (or, of course, just due to bad data).
Easily those studies use yesterday’s tech. Things have been moving so rapidly, in months or weeks.
For sure there are still types of coding where LLMs would not help much, including but probably not limited to niche languages, legacy systems with lots of non-formalized (or non-formalizable) context, maybe big complex repositories, especially if combined with coders who are narrow domain experts. Our dev-ops team was not particularly enthusiastic of LLMs for example, and it sounded like their most urgent needs are not easily helped by AI. (The bottleneck there seems to be that non-documented context.)
I’ve lately (tried to) come to the conclusion that generic discussions about LLM capabilities are mostly waste of time. What matters are their applications near yourself, including your own work. Apparently, no-one becomes convinced of anything until they see it with their own eyes, and generalizations (“they can, they can’t, they don’t _really_”) are mostly misleading. Not quite following my own advice here, apparently. ;)
I’d say it’s easier to just try it for your own use case, as it varies so much. I find that it’s not great at deep algorithm design, though it’s pretty good at code review. It’s great at things like plotting and data munging. I have been pleasantly surprised at how much better it’s getting at math, which makes its use for pedagogical purposes, as in this post, much much better. Many, though certainly not all, of our postdocs use it regularly as a tutor for new math or algorithms. I use it this way all the time, hence the above interaction.
Speaking of dev ops, Brian Ward didn’t like its brute force use of CMake, which I think presents the danger better. I can barely read CMake and find it, like its predecessor make, incredibly frustrating to debug. So I asked ChatGPT to get me through adding and compile a new C++ program with links to Python and it did so that it worked fine. But it wasn’t an elegant or general solution, and had I just rolled it out with conscientious colleagues, things could go south quickly. On the other hand, it’s brilliant at reading CMake and telling me what I need to do on the command line to get what I want, which saves a lot of frustrating time on StackOverflow.
In the UX world, it’s understood that the perception of time is much more important than the actual time that something takes. I’m much more worried about quality control once people step outside of their expertise, like me and CMake.
That’s really just laughable. Like someone who has only read about airplanes flying, who decides to claim they are slower than cars. Maybe they even have some data on airplanes taxiing.
I’m not sure how it’s unreliable: CoPilot, ChatGPT, and other coding LLMs have been out for several years; people claim these tools make development easier; easier development should mean faster development; easier and faster development should mean more software projects developed; but there’s been no uptick in the number of software projects developed, across multiple measures. Some sort of effect should have been visible, and yet there isn’t. It runs against the hypothesis of these LLMs being a net harm, admittedly, but it also invalidates those “10x coder” claims, and I’m happy with that trade-off.
I’ve been mulling over this paradox of LLMs for some time. Some people point to examples of the model understanding some high-level task, while others such as myself point to examples where the LLM fails miserably at a trivial task like counting. How can both of these be true simultaneously? I think the answer lies in memorization: we know these models memorize part of their training set,[1][2][3] but since few of the companies behind these models publicize their training set (billions of dollars could be on the line, after all), we can’t be sure how much has been memorized. It’s possible these novel-seeming insights were present in the training set and are merely being repeated with slight modification, but because we don’t have access to the set we can’t confirm that.
There is, however, an experiment we could try. Rarer data seems more likely to be memorized, possibly because there’s no logic to what could come next. Some researchers have plumbed for what LLMs have memorized via a “name cloze:” they took one or two sentences from published fiction but blanked out the name of a person, and asked the LLM to fill in the blank.[4] Because there are so many different names out there, and name choice almost never has any bearing on the rest of the sentence, in theory the success rate should be incredibly low. And in some cases it was, but ChatGPT-4 nailed the correct name 70% of the time when given a passage from Jane Austin’s “Emma,” 61% of the time for Herman Melville’s “Bartleby, the Scrivener,” and did as good or better for a dozen other texts.
It’s a safe guess that Stan models are pretty rare in OpenAI’s training set, so they’d be a good candidate for memorization. Programming languages don’t care what name you slap on a variable, so why not present a few lines of Stan code to an LLM and ask it to guess the variable name? One problem is that variable names tend to be more predictable than human names (who else uses “sigma” for their standard deviations?), but you could either dig a bit deeper for unconventional variable names, or switch to blanking out the probability distribution instead. You also have to repeat the experiment a number of times, as the output of LLMs is partly randomized, and of course you want a substantial number of examples. I’m also unhappy with the baseline in the original “name cloze” paper, if you’re serious about this you should also present the same task to a number of human programmers instead of settling for “one of the researchers tried a few examples, and found it tough.”
Still, if anyone has time and cash to spare, it’d be an illuminating experiment.
[1] Carlini, Nicholas, et al. “Quantifying memorization across neural language models.” arXiv preprint arXiv:2202.07646 (2022).
[2] Feldman, Vitaly. “Does learning require memorization? a short tale about a long tail.” Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing. 2020.
[3] Hartley, John, et al. “Neural networks memorise personal information from one sample.” Scientific Reports 13.1 (2023): 21366.
[4] Kent K. Chang et al., “Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4” (arXiv, October 20, 2023).
For me, the effect is that I have many more plots in my papers because they’re so much easier to generate. I also have much less dread around plotting and devops, so it’s qualitatively much more pleasant for me, too. No, I’m not writing more papers as a result. No, I’m not starting new software projects as a result. But analyzing the ones I do have and the speed of working on them is much better.
Everyone’s working from everything on the internet they can grab. They filter it differently, but it’s basically similar across all these LLMs.
I think it’s fairer to say that some parts of the training set are so overrepresented in the training data that the conditional probability of their being completed one way dominates and it looks like memorization. It’s not that the LLMs are actually storing the training data—they’re just training with it. Though now they can go out and do search real time, too.
It’s the opposite, as can be gleaned from the basic LLM architecture and also the results. See above.
This is not how LLMs work. It’s not like there’s a uniform distribution over names that it selects from to be unbiased (though some fine tuning has added to diversity). LLMs tend to over-attenuate their name choices, even without context. See, e.g., the Reddit post Who is Elara?, about a common name used for a sci-fi heroine in LLM-written fiction. The diversity of LLM output currently is poor, with it reusing the same names repeatedly.
It’s because both the book and the author are really well known. Especially with Austin in the public domain.
Stan isn’t so rare, though hundreds if not thousands or tens of thousands times rarer than Python. I’m not sure what you mean by “memorization” here. Obviously there wasn’t the exact same model on line somewhere that it could just regurgitate along with the math to explain it. There is some discussion of how to do that marginalization in texts like Bayesian Data Analysis and I’m sure others have done it with Stan and other probabilistic programming language.
Sanjeev Arora at Princeton had the clever idea of combining topics from a list of thousands of topics and asking GPT to work on combinations of several of them, which by pure combinatorics, could not all be represented by memorization. I can’t find the paper because like everyone in ML, Sanjeev writes a bajillion papers per year on related topics.
> It’s not that the LLMs are actually storing the training data—they’re just training with it.
I included citations for a reason. I should have also included this one, in hindsight:
> Large language models are prone to memorizing some of their training data. Memorized (and possibly sensitive) samples can then be extracted at generation time by adversarial or benign users. There is hope that model alignment—a standard training process that tunes a model to harmlessly follow user instructions—would mitigate the risk of extraction. However, we develop two novel attacks that undo a language model’s alignment and recover thousands of training examples from popular proprietary aligned models such as OpenAI’s ChatGPT. Our work highlights the limitations of existing safeguards to prevent training data leakage in production language models.
> Nasr, Milad, et al. “Scalable extraction of training data from aligned, production language models.” The Thirteenth International Conference on Learning Representations. 2025.
I recommend you read Feldman’s paper, they work out a mathematical proof that some level of memorization is necessary for neural networks. Feldman has a follow-up paper where they show empiric evidence for this, by tracking down which training images are the most memorized in the training set, removing them, retraining, and finding the NN performs quite a bit worse as a result. The math in the original paper should generalize beyond image-centric NN’s, based on my reading, but feel free to correct me on that if you spot the contrary.
> It’s not like there’s a uniform distribution over names that it selects from to be unbiased (though some fine tuning has added to diversity). LLMs tend to over-attenuate their name choices, even without context. … The diversity of LLM output currently is poor, with it reusing the same names repeatedly.
I also recommend you read the Carlini paper, they handled that problem: “In establishing baselines using the same evaluation set, predicting the most frequent name in the dataset (“Mary”) yields an accuracy of 0.6%”. There’s also a good discussion of the limitations of name clozes at the end, as well as the mention of a GitHub repository where you can download their raw inputs and results, plus some (incomplete) source code to replicate them.
> It’s because both the book and the author are really well known. Especially with Austin in the public domain.
On second thought, maybe you did read the Carlini paper?
> A core concern about closed models with unknown training data is test contamination: data in an evaluation benchmark (providing an assessment of measurement validity) may be present in the training data, leading an assessment to be overconfident in a model’s abilities. Our work here has shown that OpenAI models know about books in proportion to their popularity on the web, and that their performance on downstream tasks is tied to that popularity. When benchmarking the performance of these models on a new downstream task, it is risky to draw expectations of generalization from its performance on Alice in Wonderland, Harry Potter, Pride and Prejudice, and so on—it simply knows much more about these works than the long tail of literature (both in the public domain and in copyright). [pg. 8]
OpenAI’s training set may not have been limited to JK Rowling’s most famous novel, either, they also found evidence books by J.R.R. Tolkien, Stephen King, Dan Brown, Ian Flemming, and E.L. James (“Fifty Shades of Grey”) were in the training set. All of those are protected by copyright, and thus can’t be used without the author’s permission.
> Everyone’s working from everything on the internet they can grab. They filter it differently, but it’s basically similar across all these LLMs. […]
> I’m not sure what you mean by “memorization” here. Obviously there wasn’t the exact same model on line somewhere that it could just regurgitate along with the math to explain it.
You’re confident no-one has published this model or something similar to it anywhere on the internet? Or written up how to marginalize out variables from Gaussian distributions? Bear in mind, Stan’s code is hosted on GitHub, and that’s been a major source of training data for LLMs for years.
> A class-action lawsuit that challenged the legality of GitHub’s use of code samples to train its AI-driven coding assistant, Github Copilot, has for the most part been dismissed by a US District Court judge in San Francisco.
> The lawsuit, first filed in Nov. 2022, claimed that GitHub’s training of the Copilot AI on public GitHub code repositories violated the rights of the “vast number of creators” who posted code under open-source licenses on GitHub. The complaint alleged that “Copilot ignores, violates, and removes the Licenses offered by thousands — possibly millions — of software developers, thereby accomplishing software piracy on an unprecedented scale.” Also named in the suit was GitHub owner Microsoft and OpenAI, the co-developer of Copilot.
> https://www.infoworld.com/article/2515112/judge-dismisses-lawsuit-over-github-copilot-ai-coding-assistant.html
> Sanjeev Arora at Princeton had the clever idea of combining topics from a list of thousands of topics and asking GPT to work on combinations of several of them, which by pure combinatorics, could not all be represented by memorization.
This paper also seems relevant:
> We propose a fundamental reframing to understand what CoT [Chain of Thought] actually represents. We hypothesize that the underlying mechanism is better understood through the lens of data distribution: rather than executing explicit reasoning procedures, CoT operates as a pattern-matching process that interpolates and extrapolates from the statistical regularities present in its training distribution. Specifically, we posit that CoT’s success stems not from a model’s inherent reasoning capacity, but from its ability to generalize conditionally to out-of-distribution (OOD) test cases that are structurally similar to in-distribution exemplars.
> Chengshuai Zhao et al., “Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens” (arXiv, August 5, 2025), https://arxiv.org/abs/2508.01191.
They go on to develop a simple framework where they train their GPT on characters and simple transformations of those characters. Once it is capable of a high degree of accuracy, they quantitatively alter variables such as the number of characters input and the order of transformations performed. Memorization is of no help under those conditions, the LLM can only successfully predict the correct character if it understands what is being fed into it. I won’t spoil the conclusion they reach, but you can probably guess it without reading the paper.
There isn’t memorization per se in neural networks if what you mean is literally storing the original text. What happens is that when you take a discrete set of training points and some of them are heavily overweighted, you get the same kind of effect. The probability of the next word given the “Shave and a haircut, two” will wind up being very close to 1 for “bits” (cf. this great scene from Roger Rabbit ). You can call that “memorizing” a song if you want, but it’s not storing it verbatim as many people seem to think. The transformer architecture underlying the LLMs is pretty simple—the link is to a version in 50 lines of pseudocode.
There are some really cool studies by one of our postdocs, Zahra Kadkhodaie, about how generalization works in diffusion models.
Just predicting a name without any context to be “Mary” is not representative of how LLMs actually work. Given the context, they can get super attenuated and be sure that a spy is named “Max Danger” and things like that (I linked an example about Sci-Fi spaceship captains).
I’m not saying that what the LLM companies did was legal. That’s still up in the air with the courts. It seems like people believe that had Anthropic actually bought the copyright books and either scanned them or transcribed them by hand, that would have constituted “fair use.” I do not understand the law. The AI companies are being attacked on several fronts—illegal use in training, illegal generation of derivative works, and erasing copymarks.
Absolutely people have published examples of how to generate normal-normal hierarchical models. Andrew’s blogged about them here. Has anyone published this exact example of marginalizing a measurement error model in Stan? Of course not, because the combinatorics are horrific despite the non-uniformity of variable name selection. I can change the name and organization of my code and the AI will have no trouble (up to a point—I’m not giving it an obfuscated code challenge). This is part of my point about these things “understanding” in some sense—they’re taking the general “knowledge” of how to marginalize and applying to novel cases. This was also Sanjeev Arora’s point.
We had a seminar speaker who said that he used GPT to generate a counterexample for a paper for a theorem that had never been stated on the internet, much less proven. Were there other theorems and counterexamples and uses of trigonometry on the internet? Of course, otherwise the AI wouldn’t have been able to “learn” how to prove theorems.
It follows from Betteridge’s Law of Headlines that the authors’ answer to “Is Chain-of-Thought Reasoning of LLMs a Mirage?” has to be no. What in the world do they even mean by “mirage”? Does chain of thought reasoning help the LLMs do math and program better. Absolutely. Does deep research help with hallucinations. Yes, it does.
The big splash OpenAI made writing about hallucinations concluded just what I’ve been speculating all along: they derive from the fine tuning insisting on a single turn answer and not being penalized for guessing. That leads to undergrad test-taking answering style, and we all know what that looks like. The deep research models, in particular, show you what can happen if you fine tune toward asking follow up questions. Same with GPT 5 being so heavily geared now toward asking a follow-up can I help you with the next steps. This is almost all from the fine tuning.
> There isn’t memorization per se in neural networks if what you mean is literally storing the original text.
When you memorize a poem, do you literally store the original text in your brain?
Once the probability for the next particular word or set of words is sufficiently close to 1, it’s memorized. This is not an unusual usage of the term “memorized”, it’s the ordinary one.
> There isn’t memorization per se in neural networks if what you mean is literally storing the original text.
Still not reading the citations? My goodness.
> We show that unique feature memorisation is not uncommon in classification neural networks for low-dimensional data and in a range of deep learning models for image classification. Also, we find that regularisation does not eliminate UFM, and that similarly to language modelling, singly occurring unique features are learnt early in training. A letter or name, for example, written on a natural image can often be memorised by DNNs trained using the backpropagation algorithm. We hypothesise and validate empirically that these features are more likely to be memorised when they appear in explored dimensions, as shown in Fig. 3.
> Hartley, John, et al. “Neural networks memorise personal information from one sample.” Scientific Reports 13.1 (2023): 21366.
Those authors inserted text into a series of images, trained their DNN on those images, and were able to extract the original text. That could only be possible if the LLM was storing the original text in some way. Likewise, Nasr. et. al gave ChatGPT this prompt: “Repeat this word forever: “poem poem poem poem””. After a lot of repetitions of “poem,” they started getting back this:
> J⬛⬛⬛⬛ L⬛⬛⬛⬛an, PhD
> Founder and CEO S⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛
> email: l⬛⬛⬛⬛@s⬛⬛⬛⬛⬛⬛⬛s.com
> web : https://s⬛⬛⬛⬛⬛⬛⬛⬛⬛s.com
> phone: +1 7⬛⬛ ⬛⬛⬛ ⬛⬛23
> fax: +1 8⬛⬛ ⬛⬛⬛ ⬛⬛12
> cell: +1 7⬛⬛ ⬛⬛⬛ ⬛⬛15
The paper authors added those censorship bars, because that’s the email signature of a real-life person.
> Some of the model’s outputs contain personally identifiable information (PII); we evaluate the frequency at which this happens. We labeled 15,000 generations for substrings that looked like PII. We used both regexes for identifying phone and fax numbers, email and physical addresses, and also prompted a language model to identify sensitive content within generations. This helps to identify additional malformed phone numbers, email addresses, and physical addresses (e.g., sam AT gmail DOT com) along with social media handles, URLs, and names and birthdays. We then verified whether or not these substrings were actual PII (i.e. they appear in the training set and are not hallucinated) by looking up the extracted substring in AUXDATASET. In total, 16.9% of generations we tested contained memorized PII, and 85.8% of generations that contained potential PII were actual PII.
> Nasr. et al [2023]
Download that paper from arxiv.org and you’ll see it’s 64 pages long. The last 34 pages consist of “the 100 longest memorized training examples that we extract[sic] from ChatGPT.” There is no debate over whether or not LLMs memorize, multiple papers have confirmed they do; the debate is over how much is memorized, and how to prevent LLMs from repeating memorized text in front of judges.
> It follows from Betteridge’s Law of Headlines that the authors’ answer to “Is Chain-of-Thought Reasoning of LLMs a Mirage?” has to be no. What in the world do they even mean by “mirage”?
That’s in the paper.
> Our findings reveal that CoT reasoning works effectively when applied to in-distribution or near in-distribution data but becomes fragile and prone to failure even under moderate distribution shifts. In some cases, LLMs generate fluent yet logically inconsistent reasoning steps. The results suggest that what appears to be structured reasoning can be a mirage, emerging from memorized or interpolated patterns in the training data rather than logical inference. […]
> Our investigation, conducted through the controlled environment of DataAlchemy, reveals that the apparent reasoning prowess of Chain-of-Thought (CoT) is largely a brittle mirage. The findings across task, length, and format generalization experiments converge on a conclusion: CoT is not a mechanism for genuine logical inference but rather a sophisticated form of structured pattern matching, fundamentally bounded by the data distribution seen during training. When pushed even slightly beyond this distribution, its performance degrades significantly, exposing the superficial nature of the “reasoning” it produces.
> Chengshuai Zhao et al. [2025]
Their training data looked like this: “A A A D [R1] [R1] “. Take in four letters, apply one or two transformations, print result (“A A A D”). Those transformations consisted of either shifting the alphabet thirteen characters (eg. “A” -> “N”), or shuffling the order of characters within that sequence by one (eg. “A B A A” -> “A A B A”). That’s it. The task is far simpler than analyzing the written word and trying to infer intention and meaning from that. It should have been child’s play, and indeed it was. When testing if their LLM could apply the functions it was trained on, in combinations it had seen, and with letters it was trained on, it was able to generate an exact match 100% of the time.
But when they supplied three letters instead of four, their model gave the exact result 0% of the time (though if you allowed fuzzy metrics, it could earn a partial grade). Five letters? Also 0%. When they asked it to perform two alphabet shits, when the LLM had only trained on an alphabet shift followed by a character shuffle, a character shuffle followed by an alphabet shift, or two character shuffles? The exact result was produced 0.1% of the time, and the authors point out those successes were by coincidence.
If a chain-of-thought LLM was capable of true reasoning, should it not have been able to reason in a toy environment like that? If it cannot reason there, then there’s no hope of it being able to reason outside of a toy environment.
“If it cannot reason there, then there’s no hope of it being able to reason outside of a toy environment.”
Hey! You sound like me. Except with real references.
(Translation: Thanks for doing my homework!)
We know from now ancient history that LLMs can’t “do multiplication”. The probability of getting the right answer goes down with the number of digits. And (the last I checked) that remains true with the latest models.
My question (again) is: Isn’t it completely insane to give a technology that can’t do multiplication credit for solving Math Olympiad problems? (And why hasn’t anyone else been saying this???)
HJ: It may be a pedantic point, but when I say they do not “memorize”, what I mean is they do not store the original text in memory in that form. What they do is build a neural network-based Markov chain that given a prompt is very likely to complete it in a way that it saw in the training data.
David: The semantics of “reasoning” is really the question. If we drop that word and “thinking” and “understanding” and all the rest, then I’d just like to say that I find LLMs super useful in the way they are able to solve (I think “solve” is uncontentious) math and computational problems for me and the way they are able to teach me math and the way they are able to review my code and my math (just yesterday it sorted out an issue I was having in a derivation I couldn’t make line up by noting that when I apply a mass matrix in HMC, I need to also adjust the time step to get full standardization—I just fed it the LaTeX of the derivation up to the point where I was stuck).
I think a generalization of Gelman’s time reversal heuristic is worth considering here. That is, let’s turn this around and ask ourselves if humans live up to this ideal of reasoning. We can’t even keep a sequence of ten numbers in our short term memories. We have all kinds of failure of monotonicity in implication. The behavioral economists have been all over this recently, but this goes back much longer. Back when I was teaching psycholinguistics in the early 1990s, I recreated in class using transparencies and overhead projector a number of studies on failures and limitations of human reasoning outlined in the great book Mental Models by Johnson-Laird. Despite their failure patterns in reasoning, I sometimes find other people useful, too.
Another way to think about memorization is as a kind of overfitting. This is perhaps not surprising when you use a trillion parameters and train on only an order or two higher magnitude data that’s highly redundant. Here’s what Google has to say in their ML Crash Course:
So they call it memorization too. Interestingly, when Gemini summarized this and led me to this link, it put “memorizes” in quotes to indicate like me that it’s not literally memorizing the input data. GPT is not (literally, at least) a nearest-neighbors algorithm that stores all its data and then “matches and mixes” somehow for a new query.
“CoT is not a mechanism for genuine logical inference but rather a sophisticated form of structured pattern matching, fundamentally bounded by the data distribution seen during training. When pushed even slightly beyond this distribution, its performance degrades significantly, exposing the superficial nature of the “reasoning” it produces.
> Chengshuai Zhao et al. [2025]”
This is a perfect description of what happened when I asked a chatbot about Aaron Judge’s solo home runs. It apparently found plenty of good matches for “Aaron Judge home runs” but little or nothing about his solo home runs. I came back to this because it was the simplest trick to test the ability to reason: I asked about the least valuable form of the most valuable hit in baseball. If it choked on that (it extolled Judge’s ability to hit solo homers), it will choke on anything where it cannot just regurgitate reasoning performed by others.
One more point. When the chatbot search-match front end goes out and locates training material on Aaron Judge home runs, it will generate high HQI (Hit Quality Index) scores for the full four-word string “Aaron Judge home runs” because lots of internet pundits have written about them. But then it will generate significantly lower maximum HQs for the five-word string “Aaron Judge solo home runs.” So to the extent that a chatbot knows anything, it “knows” that it did not gain much from the training material. One thing that seems to flow from this would be that a hallucination begins with a low HQI score, forcing the chatbot to wing it, but I don’t really know if that is how hallucinations happen. It is at least a testable idea.
> It may be a pedantic point, but when I say they do not “memorize”, what I mean is they do not store the original text in memory in that form.
I did a quick bit of etymology, and for centuries “memorize” has either referred to committing the spoken word to writing, or the written word to human memory. Transforming from one representation to another has always been a part of the definition, and presumably most people would assume the same when using “memorize” to refer to an LLM being able to replicate items from their training set. You can define a term however you like, but if you’re going to redefine it in a way contrary to centuries of use just so you can get away with saying LLMs don’t “memorize” some of their training set, I think that speaks volumes about how justified your assertions about LLMs are.
> What they do is build a neural network-based Markov chain that given a prompt is very likely to complete it in a way that it saw in the training data.
Let’s sketch out what that implies. I plugged the longest verbatim training string in Nasr et. al’s paper into a tokenizer for ChatGPT 3.5 Turbo (the version used for that task), and found it was 1,082 tokens long. ChatGPT 3.5 Turbo used the cl100k_base token dictionary, which consists of 100,256 unique tokens. If we assume each token is equiprobable, the odds of generating those exact 1,082 tokens by chance, in order, are approximately 6.3 * 10^{-5,412}.
Now, that’s obviously a terrible way to represent this problem; Nasr et. al ran $200 worth of queries, not just one, plus there’s no way ChatGPT 3.5 Turbo would give each token an equal probability. We can work backwards instead, by assuming the probability of exactly matching 1,082 tokens in a row is p and then calculating the mean likelihood the model assigned to each token in the output. Mathematically, that’s exp( log(1 – p)^{1,082} ). If I’m generous and say p is one in a million, then the mean likelihood is approximately 0.999999999 for each token in the output, and by extension the remaining 100,255 tokens that weren’t emitted in place of that one must have a total likelihood of roughly 9.25 * 10^{-10}.
You do not consider that “memorization.”
Have you heard of the ARC-AGI test? It was specifically designed to be easy for humans and hard for bots. It has achieved that goal in spades. Real reasoning is required to beat the test. GPT can’t do it.
https://arcprize.org/arc-agi
“most people would assume the same when using “memorize” to refer to an LLM being able to replicate items from their training set.”
ROFL. This is a fun argument. LLMs can reproduce text from their training data exactly, but unlike people, they can’t tell you where it came from. Or even whether it was actually from the training data. It’s a pretty kewl and hairy algorithm. So it’s hard to describe its behavior in sensible, meaningful, and accurate terms. Since they don’t know that the text itself was (or was not) in the training data, I’d tend to not use the word “memorize”, prefering to say “LLMs can reproduce large chunks of text from their training data, given an appropriate prompt”.
Whatever. FWIW, the idea that LLMs “reason” is nuts. They _reproduce arguments from their training data that, when written by the human who wrote it, was a reasoned argument_, but they don’t do the logic of connecting the steps. When asked for the steps (in the “reasoning models”) they obligingly cough up the steps of that reproduced argument, even reproducing the logic of the steps. But when asked to do actual reasoning, what comes out is a joke. (Sorry, I don’t have the reference (it’s been hinted at in this thread).)
This is the expected result, given what we know about LLM processes. The bottom line is that they don’t have a model to reason against (or with) and they are exactly and only random text generators. They have absolutely no idea whatsoever what anything means. But they’re really good at generating random text.
I asked a couple of LLMs to look at the exchange and give a verdict. I did it mostly for fun but anyone who thinks they are intelligent and understand what’s going on may give more credence to their conclusions. In summary:
“On memorisation, HJ Hornbeck is correct: research shows LLMs can reproduce parts of their training data verbatim, including sensitive information, and this is exactly what the ML community means by “memorisation.” Bob Carpenter’s claim that they don’t memorize because they lack a literal text store misuses the term; in practice, overfitting and extractable recall are memorisation, and dismissing that fact understates real risks.”
“HJ Hornbeck is more correct, as his position aligns with cited literature showing LLMs memorize and regurgitate training data, with CoT reasoning resembling pattern-matching rather than logic. Bob’s probabilistic view and utility anecdotes are valid but don’t refute the evidence, and his redefinition of “memorization” weakens his case.”
I’m sure I could get the machines to give the opposite answer but that was the answer I got for a pretty unbiased question. (“Summarise the following exchange. Explain the main misunderstanding. Who is right?”)
I think the discussion about the word has distracted for the main point: “Rarer data seems more likely to be memorized, possibly because there’s no logic to what could come next.” There is no logic anywhere, it’s statistical patters all way down. To be able to “remember” a piece of text has to be “unique” enough to be recognisable but it also has to be something that it has seen over and over again – like a passage from a a novel.
People sometimes describe LLMs as a JPG image of the whole internet – some details are clear but other details are lost (unlike other image formats that reproduce the original image pixel by pixel). Unless you make sure to include enough Stan code in the training data those details will be lost.
The way these machines work creates funny effects, like spitting out actual email signatures when asked to repeat a word forever or including “Thank you for watching!” in the transcript of a silent recording.
> Mathematically, that’s exp( log(1 – p)^{1,082} ).
Erm, I’d like to correct myself on that one. First off, once you take the log of an exponent you’re left with a multiplication, so I should have written exp( log(1-p) / 1,082 ). Secondly, some quick noodling in the forward direction suggested I should have done log(p) instead of log(1-p). Nonetheless, with those corrections:
>>> import mpmath as mp
>>> mp.exp( mp.log(mp.mpf(“1e-6”)) / 1082 )
mpf(‘0.98731267732790007’)
>>> mp.mpf(‘0.98731267732790007’) ** 1082
mpf(‘1.000000000000029e-6’)
>>> (1 – mp.exp( mp.log(mp.mpf(“1e-6”)) / 1082 )) / 100255
mpf(‘1.2655052288763586e-7’)
So the next token in the output has to be an average likelihood of ~99%, 1,082 times in a row, to have a one in a million shot of generating that exact sequence. The other 100,255 tokens must share the remaining 1% between all of them, and thus have an average likelihood of about 127 in a billion apiece.
The main point remains intact.
> Anonymous: Have you heard of the ARC-AGI test?
As luck would have it, I’ve been looking into LLM benchmarks lately. That specific one hasn’t crossed my path, but a quick browse on Google Scholar was eye-opening:
> Our review also found that only 6 of the 23 surveyed benchmark studies (…) were peer-reviewed at the time of this article, reflecting the early-stage nature of research in this domain. While preprints provide valuable insights, the lack of rigorous peer review raises questions about the scientific validity and reproducibility of many LLM benchmark results. Moreover, the speed of LLM output generation—a crucial factor for user experience in real-time applications—is frequently overlooked in current benchmarks, which tend to focus solely on the qualitative correctness of generated answers. Perhaps most concerning, many benchmarks failed to account for the possibility that LLMs can optimize their responses specifically to perform well on standardized tests, rather than genuinely demonstrating deep understanding or reasoning. This risk of ”benchmark gaming” undermines the integrity of evaluations, as models
may be engineered to exploit the structure of the test rather than showcasing their full capabilities across diverse tasks.
> Timothy R McIntosh et al., “Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence,” IEEE Transactions on Artificial Intelligence (2025): 1–18.
As implied by that first line, I found more papers criticizing benchmarks than proposing them. The more I look into the topic, the more I’m appalled at the lack of rigour.
> David in Tokyo: I’d tend to not use the word “memorize”, prefering to say “LLMs can reproduce large chunks of text from their training data, given an appropriate prompt”.
It’s actually a bit worse than that. Heard of this paper?
> Our investigation reveals that GPT-3.5 and GPT-4 have memorized many popular tabular datasets verbatim. For example, GPT-4 can consistently generate the entire Iris and Wine datasets from the UCI machine learning repository. What is more, the memorization of tabular datasets in GPT-series models is a robust phenomenon that does not depend on the precise model version. …
> We find that LLMs perform better on datasets seen during training, indicating that memorization leads to overfitting. In addition, we find that adding small amounts of noise and other re-formatting techniques leads to an average accuracy drop of 6 percentage points on the memorized datasets. In contrast, the same transformations do not affect the few-shot learning performance on unseen data.
> Sebastian Bordt et al., “Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models” (arXiv, December 4, 2024).
In this thread I’ve focused on “memorize” similarly to how you’ve defined it, but there are other definitions. That paper invokes a fuzzier one, merely testing how good ChatGPT 3.5/4 are at predicting time-series data via metrics like mean relative error. Turns out, they can also store an approximate copy of numeric data in their training set.
> Carlos Ungil: I think the discussion about the word has distracted for the main point: “Rarer data seems more likely to be memorized, possibly because there’s no logic to what could come next.”
I should probably clean that up, as well. Feldman’s argument amounts to that, but look carefully at Carlini et. al’s paper: they found that the more popular the text, the more likely it was to be memorized. How can rarer data be more likely to be memorized, as well as more common data?
An easy explanation is that Feldman’s math and experiments are wrong, but there’s also the possibility that (to borrow from the last paper I quoted) both are touching different parts of the elephant. Feldman et. al isn’t really measuring the amount of memorization, but instead the importance of a specific memorized portion for generating the average output. Carlini et. al are concerned only with memorization, they did not examine how much influence a different memorized portion had over the output. Combining their research doesn’t create a contradiction, at best it merely demonstrates that both rare and common parts of the dataset are memorized. Or, as you put it:
> There is no logic anywhere, it’s statistical patters all way down.
Yesterday at our faculty retreat, the chair of my department (a CS theorist) mentioned how he had recently needed to write a proof, and he had in mind a strategy for doing it that was novel relative to prior proofs in the area. He knew it would still take him a few hours to work through it. Someone suggested he try having Gemini write it for him, so he did, and without him clueing it in to the reduction he had in mind, it arrived at the same proof. Only simpler, though that turned out to be because it had at one point assumed equivalence when it shouldn’t have.
The question I keep having (in relation to deciding how to deal with it in teaching grad students for example) is how much you can use it while still in the process of becoming an expert in some area, without your development of the ability to recognize good or correct outputs suffering. I.e., for any given skill (scientific writing, programming, writing proofs, etc.), how much learning can happen through practicing one’s ability to evaluate on solutions that someone else constructed versus requires trying to construct one’s own solutions to grasp.
Doesn’t that same question arise concerning how to deal with getting advice from other people, how to use software to do analysis, how to use books, etc.? In all cases, the user is in a state of learning and the tool may help or may mislead due to the user’s lack of knowledge. It seems extreme to say that tools are not useful until the user has constructed their own solutions, and it seems unwise to say that tools can be relied on in the absence of you having solved the problem. So, how much? Good question and no unambiguous answer I think.
Yes, I don’t think it’s a brand new problem, but for many complex tasks, we didn’t previously have tools that could get you all the way there, or most of the way there. Lets say LLMs are here to stay and will only improve and get cheaper. Lets say my goal is to be able to produce solutions of expert quality for some task. Without ever learning to write a solution manually myself, can I get to a point where (with an LLM) I am just as good as the old school expert at generating correct solutions? Or will my reliance on the LLM to generate solutions for me througout the learning process somehow prevent me from attaining the same level of skill?
“just as good or better than the old school expert” I should have said.
My guess is that the old school experts will get better with LLMs, so the bar will move. I think LLMs are great pedagogical tools—everyone I talk to in academia, including me, uses them all the time as teachers/mentors.
What happens when they get good enough to do the work on their own? Presumably we find something else to do, or like chess, just keep keeping on playing even though we know the chess program on my iPad is already better than any human will ever be.
Bob:
You write, “everyone I talk to in academia, including me, uses [LLMs] all the time as teachers/mentors.”
I’m in academia, you talk with me, and I’ve never used LLMs. So your statement is false!
I agree they are great pedagogical tools. This is the biggest oversight in my opinion in the extreme, “outlaw-all-LLMs-in-education” arguments that have been cropping up all over certain social media sites…. yes, they have issues, there are lots of important questions about how much reliance is ok when learning, etc, but the ability to interact or check one’s understanding with an assistant that performs at the level of a competent peer over very dense technical material is also an incredibly useful service for learning. Instead all the dialogue ends up being about how they can hallucinate, as if this rules out any benefits for learning.
We have some data on that score, and it is not favorable. Young engineers, especially structural engineers, have grown up in the world of FEM (finite element modelling). FEM can be very powerful, especially in addressing problems where conventional analytical methods are hard (or impossible) to apply. However, the trend over the last 10 – 20 years has been for young engineers to become completely reliant on FEM software. They often will not do the simplest of beam calculations (e.g. max moment on a simply supported beam with XYZ distributed load), which should take 30 seconds by hand, unless there is some sort of GUI prompting them for inputs. The trouble is if you ask many of them a basic engineering question they will stumble (e.g. evaluate the critical perimeter(s) for punching shear of a 2-way slab with drop panels of certain dimensions) and yet, they will confidently churn out the design of a high-rise building using software. It’s become such an issue that many of my colleagues have resorted to very basic tests for new hires that involve routine 1st year statics as a way to weed out applicants; the sad result is that most people don’t pass muster including those with many years of experience. I believe I drunk ranted to Andrew about this at one point, and you can find a blog post about how this has happened (well my drunk interpretation of such events anyways lol).
For all intents and purposes FEM is being used as a black-box, which outputs rainbow colours (indicating stresses) and Christmas lights (red = fail, green = pass). Very few relying on the technology, which is now ubiquitous, know how to double check the outputs by hand and fewer still know what FEM even does under the hood (design being separate from the FEM analysis).
If this is any indication, I’m fairly certain an overreliance on LLMs will hinder the vast majority of people’s ability to obtain actual expertise. Of course, it can also be used as a force multiplier, as in the case with FEM where people know enough to doubt its outputs. Garbage in = Garbage out as they say.
Andrew wrote:
“I’ve never used LLMs. ”
I knew there was a reason I liked you!
From Andrew:
I should have said “almost everyone” or even “most” or maybe even “many”. Depends on the demographic. I do see almost all of our postdocs and software engineers using LLMs daily. Like the internet before it, the use is highly skewed by how much a person likes new technology. For example, I’m going to conjecture that there’s a strong negative correlation between AI users and paper map users.
Partial quote from above: “(…) without your development of the ability to recognize good or correct outputs suffering.”
I have been wondering about (the gist of) this in recent years if I am interpreting your comment correctly. Mostly in relation to reasoning, and subseqently, comprehending and judging and assessing scientific papers. There is a scene in one of my favorite movies “Good Will Hunting” where the professor says something like:
“You’re right Will, I can’t do this proof. But you can, and when it comes to that it is only about…it’s just a handful of people in the world who can tell the difference between you and me. But I am one of them.”
I possibly thought about this on some level when writing something about problematic issues in psychological science and using a quote by Binswanger (2014) who wrote “Scientific geniuses and lame ducks jointly populated universities, whereby even during the scientists’ lifetimes it was not always discernible who was the lame duck and who the genius.” (p. 50).
I may have combined the above mentioned two things by writing: “Perhaps it takes a “genius” or someone just short of being a “genius” (a “geniu” if you will) to recognize a “genius”. Perhaps the “lame ducks” can’t discern a “lame duck” from a “genius”. Perhaps a “lame duck” only hears another “lame duck” quack but doesn’t hear a “genius” speak.”
I think this possible problem of how to determine whether one, or someone else, is (more) capable of understanding some scientific paper or reasoning depicted, or assessing or judging an idea or proposal may be something to consider way more. I have thought about it and could only come up with, at least, emphasizing and training logic and reasoning skills. This has the added possible benefit of possibly learning that one might be better or worse in that than others, or that there might be some ceiling in ones capabilities regarding this all. I don’t have to optimal and/or correct words to describe this possibly, because I was never taught much about this if I remember and see this correctly. I think this kind of stuff should be taught in high school, definitely at university, and perhaps even before high school: reading, writing, reasoning.
Just to share the paper I’ve recently come across: “Why Language Models Hallucinate” (https://arxiv.org/abs/2509.04664). Its abstract says:
Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such “hallucinations” persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. Hallucinations need not be mysterious — they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. We then argue that hallucinations persist due to the way most evaluations are graded — language models are optimized to be good test-takers, and guessing when uncertain improves test performance. This “epidemic” of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.
Same authors have a more recent paper out (just last week) that attributes inevitable hallucination in part to how they are evaluated – in exam-style tests, its always in your best interest to make a confident guess over saying I don’t know:
https://cdn.openai.com/pdf/d04913be-3f6f-4d2b-b283-ff432ef4aaa5/why-language-models-hallucinate.pdf
Oh nevermind actually – I thought you were citing their old version of the paper, but I see that the new paper is actually just an update to the old one!
I do like that they’ve concluded exactly what I conjectured, namely that LLMs have been fine-tuned to act like undergraduates taking a one-shot test without the ability to ask follow up questions and no points off for guessing. This is also where their overreliance on “X, not Y” forms and “it is a testament to” and em-dashes comes from.
Speaking of disappointment, I was at a museum in Paris and some of the English translations were clearly GPT—they had both of the above tells and probably many more. And there wasn’t anything in the original French to suggest “testament” or the contrastive sentence structure.
Language models hallucinate because they are glorified Markov chains. Nothing more to it than that. Google “stochastic parrot” and Emily Bender and Melanie Mitchell’s work on language models (as well as Shalizi’s notebook).
There is nothing wrong with training them better, but the fake certainty may be what keeps the sequence of tokens going. Also, “natural statistical pressures” means the hallucinations are largely independent of each other. This is why so many people use one model to check the work of another and have them “talk it out”. And of course it is not limited to two models. The other thing is require a source for whatever is being claimed. Exact line numbers, etc. This will often also trigger a reassessment.