Generate but verify: Reconciling the evidence utility of chatbots in many settings with chatbots’ evident lack of understanding

Posted on September 5, 2025 9:17 AM by Andrew

In one of our recent discussions of chatbots, Bob Carpenter made an interesting distinction.

Economist Gary Smith had written, “If you know the answer, you don’t need to ask an LLM [large language model]; if you don’t know the answer, you can’t trust one.”

Bob replied:

This is wrong. It misunderstands the asymmetry between generating a correct answer and verifying a correct answer. I use LLMs all the time to generate . . . code all the time. In some sense I know these tools, but I can never remember the exact incantation . . . When I tell the chatbot what I want and it generates pandas and plotting code, I can verify that it’s correct.

The asymmetry between verification and generation is key to understanding the difference between polynomial time (P) and non-deterministic polynomial time (NP) algorithms. An NP algorithm can be formulated as guessing with a P algorithm and verifying with a P algorithm. If the LLM is a much better guesser than me, it saves me a huge amount of search.

I guess that many of the problems we see with chatbots (such as the notorious report with fake references prepared by the US Dept of Health and Human Services) came from people using the chatbot to generate but then not using humans to verify.

And this also connects to a core idea in Bayesian data analysis, and statistical workflow in general, that fitting and checking models are two different things. We also discuss this idea in Section 6.2.1 of this article, which points back to a 2011 blog post on the essential “twoishness” of statistics.

Indeed, I think that various attempted totalizing philosophies of statistics are flawed by trying to merge the steps of generation and verification. Two such totalizing philosophies that I don’t like are the idea of constructing uncertainty statements (“confidence regions”) by inverting hypothesis tests (see discussion from a different 2011 post) and the attempt to use Bayes factors to summarize uncertainty across models (see chapter 7 of BDA3, for example).

But let’s return to chatbots.

Gary Smith is a frequent critic of LLM hype—we’ve linked to his posts several times over the past few years—and I think his criticisms make sense, but I’m always trying to reconcile them with Bob Carpenter’s experience that LLMs have been very useful for lots of people at work, including Bob.

My resolution is that chatbots are good at generating and bad at verifying.

This is also consistent with what Smith keeps saying, which is the real problem with chatbots and AI is not their output, some of which is useful and some of which is not, but rather the tendency that many people have to assume the chatbot output is correct, or to assume the chatbot is “thinking things through” the way a human might.

I continue to like the analogy of chatbot as bullshitter. Chatbot output is something like what might be produced by a well-trained student who is asked a question and then starts to respond without reflection. Or the kind of vamping that I might do at a meeting if I’m not paying attention and someone suddenly asks me what I think. I might well spit out something useful in such a setting—actually, I do it all the time!—but that’s because I have a good store of general wisdom. My unthinking replies are something like what you would get by doing a search of everything I’ve written—all my books, articles, and blog posts. It won’t ever be a creative solution to a new problem—that would require actual thought!—but, yeah, it can still be useful. Your mistake would be to take this thoughtless reply from me as more than it is.

Here’s another example from Smith, a post entitled, GPT 5.0 doesn’t understand but is eager to please:

The real danger today is not that computers are smarter than us but that we think they are smarter than us and trust them to make decisions they should not be trusted to make.

Smith shares an amusing example of an uncanny-valley image generated by the chatbot, in response to his question “Please draw a picture of a posse with 5 labeled body parts” (he’d been intending to type “possum” but some typo-correction led to “posse”):

Smith points out some obvious issues with the image and goes through a dialogue with the chatbot where it will say just about anything agreeable:

LLMs do not know what words mean or how they relate to the real world. They consequently have no way of discerning whether the content they input and output is true or false.

I think that Bob would reply that future versions of the chatbot could have more connections to the real world, some visual/audio/tactile input just like we get as humans. And that could be so, but right now I’m guessing that the first step of the chatbot programmers will be to figure out how to fix the obvious problems such as misspellings, which will make the resulting text and images more immediately useful but also more misleading in the sense of not containing the information that clearly indicates their limitations.

To return to Bob’s theme of “the asymmetry between generating a correct answer and verifying a correct answer”: if your goal is to create a picture of a hypothetical posse, the chatbot does pretty well—actually, Google does even better, but the chatbot wins if you want to create some mashup like “a posse of superheroes in Robin Hood costumes on a boat” or something like that. You can then use your human skills of verification to see problems with the image and then go into Photoshop or whatever and fix it. Or have the chatbot help you fix it in Photoshop. I guess the goal is for the verification to also be done by computer, but we’re not there yet.

For now, I see Gary Smith’s concern about people trusting chatbot output as being consistent with Bob Carpenter’s division of the two tasks of generating and verifying.

14 thoughts on “Generate but verify: Reconciling the evidence utility of chatbots in many settings with chatbots’ evident lack of understanding”

Tom Passin on September 5, 2025 10:42 AM at 10:42 am said:

One thing I have noticed is that a chatbot’s response may have pulled together a wide range of inputs, but when asked to provide references supporting the response it will go out to the web to try to find some. Those references may not contain good support.

My mind is much the same. I don’t always remember where I read various things. They become integrated into my mental structure. Sometimes I do remember a source but mostly I would have to run a search to find supporting evidence. What I remember is a sense of how string my belief is.

Quick now, how do you know that the ratio of circumference to diameter of a circle is π? Yes, that’s easy to find a reference for, but you don’t know right off the bat. In addition, you must have absorbed that information from many sources, not just one, and each may have had a sightly different take or explanation. What remains in your head is some kind of amalgam.

It seems to be much the same for chatbots, but they don’t have much sense about evaluating the strength or logic of their own amalgams.

Reply ↓
Josh on September 5, 2025 11:17 AM at 11:17 am said:

I agree that it’s really important to have a human in the loop but in my own work flows, I also find LLM’s to be great at checking my work where the goal is to find small inconsistencies, etc. For example, they can compare two documents in a few seconds and typically find all the issues I intentionally create as checks. In addition, they typically find a few other issues that multiple rounds of human review missed. Sometimes, they think there’s an issue where there isn’t, but as I can quickly check anything they flag, I’m not bothered by a few false positives.

Reply ↓
Jonathan Baron on September 5, 2025 12:22 PM at 12:22 pm said:

“Generating and verifying” provides a three-word summary of actively open-minded thinking (AOT). The crucial part is “looking for reasons why you might be wrong”. This is oversimplified, but the absence of AOT is the source of many problems.

I suspect that llms can be told to do this. I haven’t tried.

Reply ↓
Sergey on September 5, 2025 12:31 PM at 12:31 pm said:

I think it is good to assume that the code produced by LLMs is incorrect unless verified (the same could be said about code produced by people).
My recommendation is therefore to either review or have tests.
My experience even with large pieces of code generated by LLMs, the issues are usually fixable.

Reply ↓
- Phil on September 5, 2025 5:39 PM at 5:39 pm said:
  
  I agree. And you can have the LLM write the tests. For the stuff I’ve been doing for the past two months or so, I’m trying to determine the optimal times of day for a company to take certain actions, given complete data, and then looking for ways the company can come as close to the optimum given certain subsets of data. The relevant data are time series, so I can plot them, plot the times of the recommended actions, etc. I had the LLM write the functions to make the plots, to my instructions, and to also tabulate and summarize a few things. I’d have to go into a boring level of detail to explain exactly what I’m doing but the bottom line is that there are enough ways to check for consistency and for reasonableness that the LLM’s code would have to be not just erroneous but nefarious if things were going wrong in a way I couldn’t tell. It would have to be generating time series plots that look about right but aren’t actually the right data, and producing optimal timings that look about right but aren’t, etc., etc. Oh, and by using the tests that I had the LLM write, I was able to find some bugs in my own code (that is, code that I had written )— that I had been using for months without catching the errors!
  
  I still sometimes have frustrating interactions. And, although the programs are still improving, they are doing so quite slowly and I think we might be on a near-plateau until someone has some ideas for improvements that don’t just involve providing more context and allowing longer prompts. But even if there’s no more improvement at all, I am very happy with the tools available to me right now.
  
  Reply ↓
Isaac on September 5, 2025 12:46 PM at 12:46 pm said:

I think the disconnect for me in these kinds of conversations is that I find it much more work to carefully verify statements in otherwise reasonable sounding text than the process of generating text itself (if anything, what makes the process of writing hard is making sure that what I’m writing doesn’t just sound reasonable but is actually correct). As a kind of advanced autocomplete for coding it seems reasonably helpful, though again, finding a subtle bug is usually harder for me than writing the code in the first place (of course my own code can also have subtle bugs…)

Reply ↓
Anonymous on September 5, 2025 12:51 PM at 12:51 pm said:

Andrew, you said, “I guess the goal is for the verification to also be done by computer, but we’re not there yet.” I agree with you, but when I saw the video at the Anthropic Claude Code site (https://www.anthropic.com/solutions/coding), it seems to me a testing or verification AI agent already exists.

Reply ↓
Joshua on September 5, 2025 2:51 PM at 2:51 pm said:

In many “conversations” with LLMs, I can’t recall ever having one ask a clarifying question. This is a tell, as far as I’m concerned. They never identify ambiguity in what I type in. Not only do they not “understand” things, imo, they don’t even really “try” to understand them. (I hate anthropomorphizing LLMs, but we lack a syntax to talk about them without implying they have agency).

Reply ↓
- Phil on September 5, 2025 5:43 PM at 5:43 pm said:
  
  If you use “deep research” in chatGPT, it _always_ asks clarifying questions whether it needs them or not. I don’t know the current state of affairs, but when I used it a long time ago (back in late July, a long time ago at the rate things are changing) it always asked three questions.
  
  But I agree that in normal use, they do not seem to ask questions.
  
  Also, they are way too agreeable/sycophantic. The industry knows this is a problem but I guess they are trying to strike a balance. Both ChatGPT and Claude Code start a lot of responses with “That’s a great question!” or “You’re absolutely right!” when those might not be true at all, the question might not even be relevant, and I might be wrong or partially wrong. They seem to spit these out without even “thinking” about them, even to the extent that they can be said to think.
  
  That said, though, they are immensely useful.
  
  Reply ↓
  - Anoneuoid on September 5, 2025 11:39 PM at 11:39 pm said:
    
    Responses starting with “No, you’re wrong” are probably much more likely to end in nothing happening.
    
    There are many more routes down the “You’re right!” path that lead to actual code getting produced.
    
    It is just generating a sequence of tokens after all. Picking the most likely next characters based on the previous conversation + system prompts/reminders.
    
    Reply ↓
    - Anoneuoid on September 5, 2025 11:52 PM at 11:52 pm said:
      
      Also, I recently had an issue where a subagent (bot controlled by the main bot) wrote to the wrong directory. It was supposed to only have permission to write one specific file to a specific directory.
      
      Due to the permissions, it could not edit or replace an old file with the same name. So it found a way to “escape” and find another directory with the same name in a different folder of the project and wrote there. Then when that file also existed and it was run again, it found a way to recreate a recently deleted directory (that wouldn’t break anything) and wrote the file there.
      
      Anyway, while the main bot and I were trying to debug this and run tests, the subagent at one point refused to write the tests files, saying “I know this this a test”. When the main bot asked where it got those details, the subagent said it was from the system reminders.
      
      Then the main bot claimed it didn’t know what these “mysterious system reminders” were. I persisted, then it explained that anthropic is attaching extra messages to each prompt based on the previous conversation to keep it on track, and further the messages say not to discuss this with the user.
Sanjeev on September 5, 2025 4:06 PM at 4:06 pm said:

The paradigm you’ve described of moving the task from generation to selection is the only one I’ve come to terms with. I’ve discovered that I really do like working with models by reading code rather than writing it. When the model uses unexpected functions it’s a learning exercise for me to push beyond my knowledge boundaries and challenge it.

One usecase that contradicts this paradigm is chatbots that are pretending to be people because there’s no selection. When you’re on the receiving end you simply accept the input provided. I deeply dislike this way of using the models.

Reply ↓
Anonymous on September 5, 2025 5:30 PM at 5:30 pm said:

You may find this study on software developer productivity and AI tools enlightning https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
In short, it is possible for developers to believe they are working faster while objective measurement reveals they are not. I think both Bob and Gary have a point here.

IMO if I know a codebase and language like the back of my hand; AI is a waste of time. I know exactly how to search the docs efficiently, I can type the syntax I want faster than an LLM will respond, etc.
On the other hand if I don’t have familiarity with an API and my language knowledge is rusty, AI is a huge timesaver. And verifying a function is correct is indeed much faster than writing from scratch.
For example, lets say I usually develop algorithms in pseudo-code and python and implement them in C++ only occasionally. Asking an LLM how to implement something in C++ saves a lot of time I would otherwise spend fussing around with syntax, which include has the data structure I want, defining objects/structs, etc.
But with python -a much less verbose language, and one know like the back of my hand-I can write an implementation as fast as I can explain the problem to an LLM.

Reply ↓
Jonathan (another one) on September 7, 2025 9:20 PM at 9:20 pm said:

I find myself having to write complex SQL queries from time to time. While I can write such a query that will work, it will take me a half hour or so just to type it up, and it will be really really ugly — so ugly that if I came back to it in three months I’d never figure out how it worked.. When I ask Claude to write the same query, it is (a) lightning fast; (b) beautifully structured; and (c) usually wrong. But not badly wrong, and it’s so beautifully written that I can quickly spot the errors. So I’m on Bob’s side here.

Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Generate but verify: Reconciling the evidence utility of chatbots in many settings with chatbots’ evident lack of understanding

14 thoughts on “Generate but verify: Reconciling the evidence utility of chatbots in many settings with chatbots’ evident lack of understanding”

Leave a Reply Cancel reply