Steering a middle ground between two extreme takes on the role of statistics in the development of language models

The other day Jessica had a post on interpretable statistics for large language models in which she discussed an article by a statistician, Weijie Su, and a post by a computer scientist, Ben Recht, presenting two opposing views regarding the role of statistics in computer science.

In reaction I wrote a long comment, but not everyone reads the comments so I’ll reproduce it here:

I wonder if some of the debate, such as it is, depends on framing more than anything else.

In the title of his paper, Su asks whether language models “need” statistical foundations, but in the abstract he argues that they would “benefit” from statistical contributions.

In his post, Recht objects to checklists as a form of “bureaucracy” and refers to statistics as “a bunch of arbitrary rules.”

I wonder if the implications of human language and rhetoric are pushing the two sides apart.

On one side, Su makes very reasonable arguments for the value of statistics in the development and assessment of computer language models. But then in the title he uses the word “need,” which is some form of rhetorical overkill, in the sense that large language models are computer programs (more precisely, interactions of computer programs and human operators) that already exist and serve many functions, so clearly they don’t have any absolute need for anything more. One could argue that that large language models have been developed in light of statistics (including Bayesian methods!), so in that sense they retrospectively have “needed” statistical foundations already, but I take Su’s point to be that these models would benefit from additional statistical insight; he’s not just talking about the statistical foundations that were already there.

On the other side, Recht presents a very reasonable engineering perspective that is anti-bureaucratic: we’ve already made a lot of progress and continue to do so, so don’t tell us what to do. Or, to put it more carefully, you can tell us what to do for safety or public policy reasons, but it seems like a mistake to try to restrict researchers’ freedom in the belief that this will improve research progress. This general position makes sense to me, and it is similar to many things I’ve said and written regarding science reform: I don’t want to tell people what to do, and I also don’t want criticism to be suppressed. That doesn’t mean that science-reform proposals are necessarily bad. For example, I find preregistration to be valuable (for the science, not the p-values), but I wouldn’t want it to be a requirement.

Anyway, my point is that, just as Su seems to be making a logical leap from the reasonable statement that language models should “benefit” from statistical foundations to the possible claim that these statistical foundations are “needed,” it seems to me that Recht is making a leap in the other direction from the reasonable statement that language models don’t “need” new statistical foundations to the claim that statistics is “a bunch of arbitrary rules.”

I’m sitting here saying, “Can’t we all just get along?” Here are some quick points:

– Language models have developed in light of decades of developments in the foundations of computer science, statistics, linguistics, psychology, etc.

– With that in mind, it seems reasonable that future developments in language models will benefit from new ideas in statistics.

– To flip that last point around: language models are successful enough now that I’m sure they’ll continue to be successful even in the absence of any systematic research at all, let alone new statistical foundations. I think new statistical perspectives on language models can be useful and even important; I don’t think they’re needed.

– Checklists can be useful. It depends on what’s in the checklist. Checklists of required steps can be useful (and should be held to a higher standard than optional checklists); again, it depends what’s there.

– The Neurips checklist is ridiculously detailed–and not just in its requirements about statistics. It’s absolutely nuts, a nightmare of bureaucratic red tape–and it’s required for all submissions! Ultimately, this is Neurips’s choice. If they want to make it a pain in the ass to submit papers there, and if they want to impose a bunch of stupid requirements–including all that crap about “statistical significance”–then, hey, it’s their call. People can still publish their papers on Arxiv, ICML, JMLR, etc. It just seems like a bad idea to me.

– The existence of a stupid and epically bureaucratic Neurips policy should not be taken as a sign that future developments in language models will not benefit from new ideas in statistics. A stupid checklist is a sign of a stupid checklist, and it’s an indication that there’s some sloppy thinking going around, along with some unfortunate committee dynamics. Obsessing over statistical significance and error bars (“It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of normality of errors is not verified”) is a bad idea; there’s a lot more to statistics than hypothesis testing error bars! Indeed, I think there are very solid statistical reasons for being uninterested in the endpoints of a confidence interval. And that “96%” thing, that’s just preciously stupid–why not go whole hog and say “95.4%”?

OK, I’m drifting here. The point is that we should not let our thinking be defined merely by opposition to foolish things. Much of the Nudge industry is stupid; that doesn’t mean that all of social psychology is a scam. The much-hyped beauty-and-sex-ratio research is complete crap; that doesn’t mean that evolutionary biology is useless. A Neurips committee gave bad statistical recommendations; that doesn’t mean that statistics itself is a bunch of arbitrary rules.

– Although statistics is not “a bunch of arbitrary rules,” it can be perceived that way by non-statisticians, and statistics textbooks can sometimes encourage this attitude with their rigidity. In Regression and Other Stories we try to do better; also econometrics books by Angrist, Pischke, Cunningham, etc., present pragmatic takes, but traditional statistical education still often promotes an ideology that alternates uncomfortably between very strong assumptions (simple random sampling with perfect response, ideal randomized experimentation, Poisson and binomial distributions, etc.) and equally strong statements that nothing can believed in the absence of whatever particular assumption is being focused on at the moment.

– One of the great things about computer science is its openness to different theoretical approaches. This is something that computer science shares with political science–research in theoretical foundations is itself part of the discipline–and it differs from, say, statistics and econometrics, two fields that are often (but not always) hampered by conceptual rigidity. The openness of computer science to different perspectives is a good thing, and it should not be taken to imply that different theoretical frameworks (including those of Bayesian and frequentist statistics) will not be helpful in current and future research in the field.

I think it would help if people could be more precise when writing about science. It can be frustrating to even bring this up, because there’s this idea that insisting on precision is some sort of “autistic” engineery-thing to be doing. But in science, I think it should be possible to avoid saying things that we know not to be true, especially when speaking with an air of authority. The examples from Su and Recht above are mildly annoying to me, but perhaps they are reasonable hyperbole in the sense that no reasonable reader would believe that language models truly “need” statistical foundations or that statistics is literally “a bunch of arbitrary rules.” I was more disturbed by the examples discussed here.

P.S. I originally titled this post, “Toward more precision in meta-scientific discourse,” but then I remembered Phil’s “Even camels get thirsty” principle. Not that my new title is so great–it’s no “Even camels get thirsty,” that’s for sure–but at least it’s direct.

13 thoughts on “Steering a middle ground between two extreme takes on the role of statistics in the development of language models

  1. > I think it would help if people could be more precise when writing about science. It can be frustrating to even bring this up, because there’s this idea that insisting on precision is some sort of “autistic” engineery-thing to be doing. But in science, I think it should be possible to avoid saying things that we know not to be true, especially when speaking with an air of authority.

    Amen. I agree both authors were exaggerating in this case, though Su’s stretch seemed less egregious. Maybe I’m still irked because after I wrote that post pointing to Su’s paper, Recht wrote a follow-up post in which he implied my post was advocating for more stats in ML of the ridiculous NeurIPS checklist kind. I assume he did it to make his point seem stronger, but I do not appreciate being set up as a strawman for the convenience of someone’s arguments!

  2. Thanks for your insights! I’ll push back on this premise:

    “Language models have developed in light of decades of developments in the foundations of computer science, statistics, linguistics, psychology, etc.”

    Although it pains me to say this as someone who loves statistics and linguistics, I’ll say that this is simply not true.

    While classic language models had (weak) connections to statistics (e.g., connection between “smoothed” n-gram language models and Bayesian priors), modern neural network-based LLMs do not really rely on any statistical foundations. (Other than, perhaps, the maximum-likelihood training objective, but this is really a shallow connection in my opinion.)

    In fact, one could argue that LLMs have developed and succeeded *despite* the fact that statistical/linguistic foundations would suggest that they shouldn’t work (e.g., overaparameterization, no language-specific biases, etc.)

    • Anon:

      I’ll leave Bob Carpenter to respond regarding your statement that the foundations of linguistics were not relevant to language models. It’s my impression that lots of work in computational linguistics led up to the models being used today.

      Regarding statistics, yeah, lots of statistical ideas are important to language modeling, including regularization, Bayesian computing, logistic regression, all sorts of things.

      It is a misunderstanding of decades of developments in the foundations of statistics to say that statistical foundations would suggest that “overparameterization” would cause language models to not work. There’s been decades of developments in the foundations of statistics on overparameterized models and regularization; see section 1.3 of this paper, for example.

      • Thanks for engaging! I think the confusion may come from the difference between “classic language models” and modern “large language models”. While both obviously deal with language models, they are entirely different beasts. I agree that statistical ideas have indeed been relevant to classic language models. Modern LLMs, not at all.

        If you disagree, I’d be curious to hear how you think “regularization, Bayesian computing, logistic regression” are relevant for modern LLMs. (Sure, we use L2 decay for modern LLMs, but this does not require a statistical motivation).

        Finally, regarding connections to linguistics, I’d also be curious to hear from folks like Bob Carpenter. In my opinion there is nothing from computational linguistics that led up to the development of modern LLMs. Again, in the realm of “classic language models” I could see a connection (e.g., grammar-based language models), but this is simply not relevant for modern LLMs.

        • I had my answer but I had to ask an LLM because … rules …

          Modern LLMs learn P(token|context) using stochastic gradient descent to minimize cross-entropy loss and generate text by sampling from learned probability distributions. Temperature sampling, nucleus sampling, and other generation techniques are all applications of probability theory. The attention mechanism is essentially learned matrix factorization …

        • Intellectually, LLMs are Chomsky on steroids. Chomsky won’t like my saying that, but the major claim of Chomskian linguistics is that it’s possible to have a theory of language without a theory of meaning*. It’s in that sense that LLMs are Chomskian. Of course, not only do LLMs not do meaning, they don’t even do grammar.

          *: The generative semanticists (and the AI types of the 1970s/1980s) objected to this vociferously, but our collective flakiness resulted in the Chomskyites winning the linguistic wars. Sigh.

          Of course, this idea is completely ridiculous, and getting more so as neuroscience advances. Back in the day, modularity of mind (and/or brain) seemed a good idea, and lots of people thought/argued about it. (And Wernike’s Aphasia patients sure make it seem that you can have language without meaning.) But recent work has shown that even the lowest-level functions in the brain make use of high level information. For example “find something off/strange in this scene” is way harder than “find a cat in this scene”, and the cat recognition stuff (when you know it’s a cat) happens at the almost lowest level of image processing. So, sure, there are low level modules. But they communicate with thought at the intellectual (symbolic) level just fine, thank you.

        • I agree. Classic language models and contemporary LLMs are different beasts. I find it useful to think of LLMs as associative memories having a statistical character. As far as I know, the people who created them know little to nothing about computational linguistics, whether in its oldest form, symbolic systems from the 1950s up through the 1970s, or statistical models, starting, I believe, in the 1980s.

          FWIW back in the 1970s I was trained in computational semantics by David Hays, who’d been a first generation researcher in machine translation. He led the effort at RAND. When the funding disappeared in the mid-1960s Hays led the rebranding of the discipline as computational linguistics. He wrote the first textbook on CL and was the first editor of the American Journal of Computational Linguistics (now just Computational Linguistics). I was bibliographer of the journal for three years, which ment that I had to prepare abstracts of the current literature.

      • I’d say that traditional “theoretical linguistics” as practiced in the U.S., wasn’t much of an influence on LLMs. The effect isn’t zero, but it’s relatively small and indirect. There was some influence of more traditional (pre 1990s) natural language processing. But there were three much bigger influences.

        First, the language modeling work that originated in Shannon’s work on information theory. Shannon estimated and simulated LMs, including posterior preditive checking, in 1948. This is the framework on which LLMs were built—reducing entropy to enable better prediction (and thus as motivated Shannon, tighter compression).

        Second, the work on neural networks in machine learning. Without that, you couldn’t have built the infrastructure. That needed the GPU revolution at scale.

        Third, LLMs were influenced by all the work on language modeling using ML leading up to LLMs. For example, word embeddings were super popular for classification and information extraction. Word embeddings go back to at least Salton in the 1960s—he used vectors of document counts to embed words and documents in a vector space. We also needed language models to do speech recognition, classificaiton, spell checking, etc., so there was a lot of work there. This was all influenced to some extent by traditional NLP and so to some extent by traditional linguistic theory. But it’s not a direct line. During the ML for NLP boom, there was a ton of work on language modeling in that world that crossed over into signal processing and applications at places like Google.

        You could argue traditional understanding of language at the dialogue level is what let OpenAI fine tune LLMs. But I don’t think that was driven by a lot of traditional linguistic theory. But it was driven by an understanding of language.

        • Yes, I believe Salton’s work on document retrieval is where word embedding got started.

          In the mid 1970s ARPA (as it was called then) sponsored a major project on speech understanding, with independent projects at BBN in Massachusetts, SRI international in California, and Carnegie-Mellon. At that time the phonetics/phonology front-end consisted of hand-coded rules. Carnegie-Mellon’s system used a “blackboard” architecture which allowed the system to bring syntactic and semantic considerations to bear on speech recognition. This was just before statistical methods took over speech recognition.

  3. What strikes me, particularly in light of Ben Recht’s recent posts, is how much we’ve had to fundamentally rethink established theory. The discovery of “double descent” has essentially overturned decades of overfitting literature that predicted monotonic increases in test error beyond the interpolation threshold. We’re now grappling with phenomena that our classical bias-variance intuitions didn’t anticipate (as far as I know).

    This reflects the dynamic interplay between engineering and science, where empirical breakthroughs often precede theoretical understanding. https://www.gojiberries.io/the-nonscience-of-machine-learning/

    There are also many places where ML struggles. Firmer theoretical understanding, for instance, of the optimality of changing the objective function midway from next token prediction to RL/SFT etc. (common …; some connection to ‘hierarchical modeling’), could unlock better training recipes and help address many areas where ML currently struggles—sample efficiency, robust generalization, and training stability, to name a few.

  4. Similarity-Distance-Magnitude (SDM) networks (and estimators) already address these issues.

    This replaces the previous underlying statistical model for language models, which was introduced in I. J. Good, 1953 (“The Population Frequencies of Species and the Estimation of Population Parameters”).

  5. Dear Professor Gelman,

    Thank you for the insightful comments on my paper. It is a privilege to receive your feedback. Please accept my apologies for the delayed response.

    On the point you raised about necessity, I used “need” to describe a prerequisite rather than a simple feature. My view is that LLMs are valid for these applications only when they meet specific criteria, such as trustworthiness. In this sense, the requirement is not merely a benefit. It is part of the model’s essence. Without it, the model fails its primary purpose.

    This is where statistics becomes indispensable. Since LLMs are inherently stochastic, their trustworthiness cannot be assured or verified by deterministic checks. We have to rely on statistical frameworks to provide those guarantees and rigorously evaluate performance bounds.

Leave a Reply

Your email address will not be published. Required fields are marked *