In ML, everyone’s Humpty Dumpty

This post is from Bob.

I used to work in natural language semantics, and the following dialogue from Lewis Carroll’s Through the Looking Glass, and What Alice Found There was the most common pull-quote to see at the beginning of a thesis.

“When I use a word,: Humpty Dumpty said in rather a scornful tone, “it means just what I choose it to mean — neither more nor less.”

“The question is,” said Alice, “whether you can make words mean so many different things.”

“The question is,” said Humpty Dumpty, “which is to be master — that’s all.”

Humpty Dumpty came to mind recently after a spate of discussions with ML folks about inference (i.e., what they call “learning”).

What in the world does “empirical Bayes” mean?”

Empirical Bayes came up on Wednesday with some ML folks I was talking to, then I ran into Dave Blei this morning, who told me he’s giving a sequence of talks on empirical Bayes over the next few weeks (at Columbia and at University of Chicago). I asked him what “empirical Bayes” meant to him, because it seems to be used very fluidly in ML. He gave me a new usage, saying he used it for any model that uses data to fit parameters of a prior, including just plain old hierarchical modeling fit with sampling. Dave gave the example of ARD in Gaussian processes (aka, hierarchical models).

I would only use the term in the way described in the Wikipedia entry for “Empirical Bayes”, namely

… empirical Bayes may be viewed as an approximation to a fully Bayesian treatment of a hierarchical model wherein the parameters at the highest level of the hierarchy are set to their most likely values, instead of being integrated out.

I pinged Mark Goldstein, one of our top-notch ML postdocs, and he pretty much reeled off the Wikipedia definition. So there seems to be a lot of variation in how this is used.

Robbins on Empirical Bayes

Dave also pointed me to the following video by Herbert Robbins (yes, that Robbins, who was so far ahead of the computational statistics curve that he introduced stochastic gradient descent and multi-armed bandits in the early 1950s).

Terminology drift in ML

I’ve been a bit shocked at how many technical terms have drifted in meaning in ML. I’m not talking about people making honest mistakes or clueless mistakes, I’m talking about true drift in meaning where the ML folks will stand by their definitions.

Likelihood: I’ve seen “likelihood” used for what I’d call the data generating distribution and sometimes just as a synonym for density. Aki has this one covered in his recent post, A data model is not just a likelihood. In stats, the likelihood is defined as the function L(theta) = p(y_obs | theta)—that is, it’s a function of theta for some fixed observed data.

Causal: With the advent of LLMs, anything with an autoregressive structure is now being described as “causal” in a “past causes the future” sense. This is even being extended to arbitrary directed graphical models, which are now being called “causal” even when there’s no explicit causality being modeled. That is, you can now describe a simple regression from an observational experience as “causal” with no extra work.

Estimation: Estimation is almost always called “learning” in ML.

Parameters: These are usually called “weights” for neural networks, which I think now make up 99.9% of all work in ML. But if you tell ML folks that neural networks are parametric models, they’ll most often deny it. A statistician would confusingly call a neural network a “non-parametric model” and then tell you that means it has a lot of parameters.

Inference: In our diffusion model reading group, the ML postdocs tell me that “inference” means what I would call “posterior predictive sampling”. For example, generating output to a query from an LLM would be called “inference.”

Bias: In statistics, this usually means expected error. In ML, it’s heavily overloaded. It can be used to name just about any kind of error measure (e.g., errors in Matt Hoffman’s sampling papers). ML folks also use the term “bias” to mean the intercept in a regression (no, I’m not kidding).

Prior: This is often called an “inductive bias” in ML circles, which can include aspects of data generating distributions as well as priors.

In Bayesian statistics, a prior is the marginal distribution over parameters. In ML and informal presentations of “Bayesian statistics,” it’s just any marginal that gets plugged into Bayes’s rule. For instance, the prevalence of a disease p(disease+) is called the “prior” when I evaluate positive predictive accuracy p(disease=+ | test=+) given a testing sensitivity distribution p(test=+ | disease=+). In a k-way classification, “prior” means the marginal distribution over categories.

Bayesian: I think in ML this term is used very broadly for any situation in which there’s a prior not strictly stated as a penalty function for penalized learning (i.e., regression). I think Andrew’s down with this definition as he also uses Bayesian for anything that looks vaguely Bayesian no matter how inference is performed. For example, Empirical Bayes is just Bayes to Andrew, as is using a Laplace approximation or even a simple maximum likelihood estimate (just think of it as a one-point posterior summary!).

Regression: Note quite on topic, but I think of neural networks as just a GPU-friendly form of non-linear regression.

Uncertainty quantification: This is the primary subject of statistics, though I think the term “uncertainty quantification” is much more prevalent in engineering/signal processing than in ML. There are even journals of that title that look sort of like statistics journals.

I’d find reading ML papers easier if there was less meaning drift from well-established terminology. I’m not saying the ML folks should be up to date with Gelman’s idiomatic statistical lexicon (the concepts are fun, and it can be useful for talking to Andrew and people in his circle like the blog readers, but I wouldn’t recommend using these terms in papers without explanation).

I’m sure there are many more terms have been coined or have drifted one way or another that I’m forgetting about.

26 thoughts on “In ML, everyone’s Humpty Dumpty

  1. Lovely post. A big “Thank you!!!” for doing my homework for me.

    You wrote: “I think of neural networks as just a GPU-friendly form of non-linear regression.”

    ROFL: Essentially one of the things I’ve been screaming for years. But more accutately stated. Thanks.

    You wrote: “Estimation: Estimation is almost always called “learning” in ML.”

    ROFL. I had wondered what the “learning” in ML actually meant.

    Now I know. And it’s worse than I thought.

    Your could also add two definitions of neuron.
    1. Neuron (mammalian/cortical): A distantly-connected device with hundreds of inputs and thousands of output.
    2. Neuron (AI/ML): A locally-connected device with a small numbers of inputs and a small numbers of outputs.

    • The key insight was that, rather than avoiding overfitting, you want to overfit as much as possible: https://en.wikipedia.org/wiki/Double_descent

      This should be a big hint that regression coefficients and neural network weights are the same thing. And that those coefficients are arbitrary, only having meaning within the context of the model. Thinking otherwise was an even bigger 20th century mistake than NHST.

      • The key insight was that, rather than avoiding overfitting, you want to overfit as much as possible

        This is completely wrong. If you take a neural network and run full batch gradient descent with no early stopping you will overfit your training set and get terrible test set performance.

        • Classical wisdom suggests that complex models should overfit training data significantly. However, surprisingly, this isn’t the case. In fact, these models often generalize better than smaller ones.

          This unexpected finding led researchers like Belkin et al. (2019) to explore older theories. Their fascinating discovery revealed that once a model becomes complex enough to perfectly fit the training data, test error actually begins to improve again as the model increases in size. Deep learning didn’t just uncover this phenomenon known as ‘double descent’—it transformed it into a fundamental concept in our understanding of generalization today.

          https://www.datacamp.com/blog/what-is-double-descent-in-machine-learning

          There are many sources available on this, thats just my top search result.

        • You are misunderstanding the sources. The parameter space should be able to overfit, but you use techniques like early stopping, minibatches, priors, weight decay/L2 regularization, etc to encourage simpler representations. The idea is that a complex parameter space with soft guides to simplicity is better than manually requiring simplicity with arbitrary hard constraints. You do not “want to overfit”, you still must use regularization.

        • I think you are conflating overfitting with generalization performance. These are no longer the same thing in the “overparametized” regime. Ie, you can have perfect fit to the train data but still benefit from regularization. It’d be clearer to say there essentially is no more concept of overfitting.

          You can find more discussion here:

          https://www.argmin.net/p/thou-shalt-not-overfit/

        • Sure, if you retreat into a pointless semantic argument, you can debate the merits of any stupid statement. The point is that if you run a hugely overparameterized neural network with no regularization techniques at all, you are essentially guaranteed to get a tiny training set error but fail spectacularly on new data. Hence, I think the advice “you want to overfit as much as possible” is likely to misread any reader who doesn’t already understand the concept you’re referencing.

        • The point is that if you run a hugely overparameterized neural network with no regularization techniques at all, you are essentially guaranteed to get a tiny training set error but fail spectacularly on new data.

          This is exactly what is NOT correct. Its already explained in the above links. Instead you may get some efficiency improvements during training, etc same as tuning any other hyperparameters. Its not due to overfitting the training data.

        • Lol, you really cannot read. Look, go ahead and do it then. Go around fitting things with full batch gradient descent until training loss plateaus and try to sell your work

    • Thanks for pointing that out. I’m sure that’s part of the issue.

      And now that you mention it, I forgot to include all the terms getting introduced by physicists, at least the corner of ML for science that I’m exposed to at work (we have astro and quantum computational physics centers and our biology center is largely biophysicists).

      For example, I hear people talk about “energy-based models,” which just means “models.” By “energy based,” they mean that you write your density in the form U(x) = -log p(x) so that p(x) = exp(-U(x)). They’re also often called “Boltzmann distributions” or “Gibbs distributions” where there’s also units in the denominator, so that they’ll write a density as p(x) = exp( -U(x) / (k_B T) ), where k_B is a constant for working in Kelvin and T is time. Then we almost always work unitless just with U(x) unless we’re doing some kind of simulating annealing and play with T. So it’s an influence from statistical physics, which has a lot of contact with statistics and information theory.

      Information theory is another source of confusion because they prove many of the same theorems as in regular statistics, but in very different notation and with a very different focus. For instance, the asymptotic equipartition property (AEP) from information theory is like the law of large numbers from statistics. It’s where Shannon introduced the idea of a “typical set.”

  2. > With the advent of LLMs, anything with an autoregressive structure is now being described as “causal” in a “past causes the future” sense.

    That has been done for decades in econometrics: “There’s an important statistical notion of causality that’s intimately related to forecasting and naturally introduced in the context of VARs. It is based on two key principles: first, cause should occur before effect, and second, a causal series should contain information useful for forecasting that is not available in the other series (including the past history of the variable being forecast). In the unrestricted VARs that we’ve studied thus far, everything causes everything else, because lags of every variable appear on the right of every equation. Cause precedes effect because the right-hand-side variables are lagged, and each variable is useful in forecasting every other variable.” [Elements of Forecasting, Francis X. Diebold]

    • Thanks—-that’s really useful. Wasserman includes “supervised learning,” which I forgot to include. This one can be confusing, because training something like an LLM is often considered “unsupervised” even though you’re giving it what a statistician would consider data. A clustering algorithm is typically thought of as “unsupervised,” though you can cast something like K-means clustering as an application of the EM algorithm to optimize a standard normal mixture model.

      “Semi-supervised” just means missing data from a statistical perspective. This is especially easy to handle with a full joint (generative!) model in Bayesian inference. This is behind Rubin’s advice that I picked up through Andrew to think about what you’d do if you had all the data observed, write that model, then just treat what’s missing as unknowns and turn the Bayesian crank. As an aside, Abner Heredia Bustos is writing a new chapter for the Stan User’s Guide on the multiple imputation approach to missing data.

      • Really good points about (semi-)supervised and unsupervised learning!

        I hadn’t heard about Rubin’s advice, that’s a very useful way of thinking about model specification. Will certainly keep it in mind.

        Thank you for the good article and the great reply! Your writing is always illuminating and a pleasure to read.

  3. I don’t have a horse in this race, but I have always been bothered by the way ML people use “inference.” I often hear people use it to simply mean “deployment” of a model.

    • Exactly—it’s used consistently within the field of generative modeling to mean generating from context, for example, a chatbot response or an image generation.

      I still can’t quite tell exactly what the AI/ML folks mean by “generative.” It was one of the first discussions I had with Andrew about statistics, back when I worked in NLP and speech recognition. At that point, ML researchers distinguished generative models like language models or Naive Bayes classifiers or HMMs from what they called “discriminative” models like logistic regression or conditional random fields. Andrew’s point, which I had a hard time understanding at the time before I learned more statistics, was that you could model the covariates in a regression, but the inference for the parameters won’t change if the data’s fully observed. There’s a section of BDA that discusses this. Also, a regression model is already generative for outcomes even without a model of the covariates. This often comes up practically when you want to simulate data for a regression.

      Edit: I should clarify that most of my confusion is around things like normalizing flows or diffusions. With one technique, you can learn (e.g., estimate) the weights from a sample of possible outputs. This is what’s done to train image generation models. Alternatively, you can learn the weights of a normalizing flow or diffusion using variational inference. Even though you’re fitting the same underlying model structure, I think only the training from samples is called “generative”. And as an aside, there’s a really great paper on capacity of normalizing flows that goes over the two approaches concretely by Agrawal and Domke (2024).

  4. This is unrelated to the main content of this post, but is Empirical Bayes’ a good “middle ground” for frequentists and bayesians to use each other’s methods while still leveraging the strengths of a prior/posterior regime, as Robbins’s suggests in his video? I’m a student at Berkeley and have only heard abt EB through our theoretical stats class, but there wasn’t an emphasis on its “specialness” in relation to, say, hierarchical Bayes’, which was suggested as another way to deal with the issue of selecting a prior.

    • Rishab:

      I’d say no, that so-called empirical Bayes is not a middle ground. I see it as approximate Bayes, where inference for some parameters are given by a point estimate rather than a full posterior distribution. Also I don’t like the term “empirical Bayes” as it would seem to imply that usual Bayesian methods are not empirical. There’s no reason that usual Bayesian methods should be considered non-empirical, beyond the general point that any statistical model can be applied non-empirically if it is chosen without regard to the applied problem and if its fit to data is not checked. Recall the saying of the camel and the gnat.

  5. Excellent article. I worry about the next level of interpretation when these terms (such as ‘causal’) move beyond the ML world into the world of managers who are being sold the technology. I just have to think about how the interpretation of words such as ‘significant’ and ‘confidence’ is abused to be really concerned about ‘causal’.

  6. I once reviewed a paper by an eminent cosmologist collaborating with a younger colleague using “non-parametric statistics” for some particular fitting problem (whose details I do not recall). Unfortunately, they took the name at its word, hence assuming that their method really had no degrees of freedom, and evaluated the quality of their fits based on that assumption. (I rejected the paper…)

  7. hi bob & all

    thanks for mentioning our interesting conversation!

    in case anyone is interested, i’m giving a talk about empirical bayes (or “empirical bayes”) in the student seminar tomorrow.

    the info is below. regardless of the label, i think the math & algorithms are interesting :)

    –d

    Student Seminar
    Date: Wednesday, April 22, 2026
    Time: 12:00 p.m.
    Location: Room 1025 SSW

    Speaker: Prof. David Blei (Columbia University)
    Title: A Fresh Look at Empirical Bayes

    Abstract: Empirical Bayes improves simultaneous inference by learning from related data. In this talk, I will present three recent directions in empirical Bayes. First, I will discuss a general method based on probabilistic symmetries, which extends empirical Bayes beyond exchangeable settings to structured problems such as arrays, graphs, conditional data, and spatial models. Second, I will discuss empirical Bayes for implicit likelihoods, where the model is available only through a simulator, and show how simulation-based inference can be used to produce empirical Bayes estimates without evaluating a density. Third, I will discuss an empirical Bayes approach to combining randomized experiments and observational studies, where calibration studies make it possible to learn the distribution of observational bias and use observational data in a principled way. These three ideas illustrate new roles for empirical Bayes in modern statistics and machine learning.

    This is joint work with Sebastian Salazar, Diana Cai, Don Green, Xinwei Shen, Sebastian Wagner-Carena, Bohan Wu, and Cheng Zhang.

Leave a Reply

Your email address will not be published. Required fields are marked *