Bayesian statistics and machine learning: How do they differ?

A researcher who would like to remain anonymous writes:

My colleagues and I are disagreeing on the differentiation between machine learning and Bayesian statistical approaches. I find them philosophically distinct, but there are some in our group who would like to lump them together as both examples of machine learning. I have been favoring a definition for Bayesian statistics as those in which one can write the analytical solution to an inference problem (i.e. Bayes rule). The method of solution proceeds as dictated by the problem difficulty (e.g. analytics, MCMC, etc.). Machine learning, rather, constructs an algorithmic approach to a problem or physical system and generates a model solution; while the algorithm can be described, the “internal solution”, if you will, is not necessarily known.

I suspect you have encountered this debate or nomenclature issue yourself and I am wondering if you have any references, resources or guidance that might help us navigate the nomenclature involved so that we can have accuracy in our language and communications.

My response: The term “machine learning” is not precisely defined, and it can be considered to include Bayesian inference as a special case. Indeed, many machine learning methods are fit using Bayesian or approximately Bayesian methods.

From my perspective, I associate machine learning with big models fit to big data and minimal assumptions: instead of assigning a structure to a model or using strong priors, you just use tons of training data. The result is not free of assumptions—it makes the big assumption that the training data are like the test data, or that any differences in the two datasets can themselves be captured by the model—but they are different sorts of assumptions than are traditionally made in statistical models.

In contrast, I associate Bayesian inference with strong structures and strong priors. That is, the model is doing a lot of the work. It’s possible to do Bayesian inference with flat or weak priors, but the big benefits come with stronger models. It might seem unappealing to let the model do a lot of the work, but you don’t have much choice if you don’t have a lot of data—for example, in political science you won’t have lots of national elections, and in economics you won’t have lots of historical business cycles in your datasets.

The bottom line

Machine learning excels when you have lots of training data that can be reasonably modeled as exchangeable with your test data; Bayesian inference excels when your data are sparse and your model is dense.

52 thoughts on “Bayesian statistics and machine learning: How do they differ?

  1. I’d say “Bayesian inference” is parameter estimation or comparing how well various models explain the observations.

    “Machine learning” is automated prediction/classification. The simplest method of machine learning is just fitting a line and extrapolating to new data.

    • I’d say the simplest method is to fit a constant. In fact, some of the most successful algorithms are generalizations of this idea, but not of fitting a line.

  2. Leo Breiman’s “Statistical Modeling: The Two Cultures” paper is probably a canonical reading on this topic (and he basically lumps both as different camps under “statistics”). I do like Tom Mitchell’s definition of machine learning, which is relatively precise, but broad and would in fact include most of statistics: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” I think most concise, philosophically coherent definitions of ML would end up including most of statistics.

    But philosophical definitions don’t always describe how people actually use words; people aren’t always coherent. To your point, when most people use the term, they are usually using it to _distinguish_ the “newer” techniques of the “Algorithmic Modeling Culture”, as Breiman calls it, from the rest of “statistics”. And that’s where things get fuzzy. Personally, I consider “statistics” and this “colloquial ML” to be regions on either end of a large continuous landscape of “learning methods” rather than truly distinct categories. I think the things you point out are the right axes of variation that describe this landscape, but there isn’t a real boundary line. The more work you are getting out of an abundance of data, the more you are on the ML side of that landscape. Although I am a Splitter by nature, I think it has been valuable to me to recognize the continuity between the two camps rather than treating them as philosophically distinct things. I think there is a lot of fertile ground in the intermediate zone.

      • Yeah, by “canonical”, I didn’t really mean that it was correct or the final word so much as an important text that has generated and contextualized a lot of the discussion on this topic. Following up on all of the articles that respond to it would be one pretty good way to approach how people think about this topic over the past couple of decades. It should surely be on the syllabus.

  3. Kari Lake still believes she was elected governor of Arizona

    https://www.newsweek.com/kari-lake-shares-election-update-court-appeals-1773625

    and Couy Griffin, the insurrectionist and now former election judge in New Mexico, maintains

    “My vote to remain a no isn’t based on any evidence, it’s not based on any facts, it’s only based on my gut feeling and my own intuition, and that’s all I need.”

    Needless to add that Trump repeatedly says, “I won in a landslide.”

    The underlying assumptions would appear to be far from minimal so, according to what Andrew just wrote, machine learning is not taking place. The priors would seem to be exceedingly strong, so it appears that some sort of Bayesian inference is being done. The sincerity of Lake, Griffin and Trump is beyond question. It is difficult for people with an ax to grind, myself included, to avoid confirmation bias and belief perseverance.

    • I agree with your last sentence, but it’s a stretch to say that these election deniers are doing “some sort of Bayesian inference”–they’re just going with their prior and ignoring the data. (I guess you could say that the likelihood function is a uniform distribution, but it’s hard to see that as “inference.”

      • “I guess you could say that the likelihood function is a uniform distribution, but it’s hard to see that as “inference.”

        I think, actually, that is precisely what is going on. What, after all, does it mean to distrust someone?

        When I am given “information” from a source I distrust, I explicitly reason that “well, they would say that either way.” In Bayesian terms that means that likelihood ratio is 1, so my posterior probability equals my prior. (This is for the case of estimating a single probability of some event–the math can get a bit more complicated but the concept remains the same for other situations.)

  4. The real interesting connection between machine learning and Bayesian techniques is how regularization often tends to be equivalent to a Bayesian prior.

    L2 regularization in particular.

    • Important minutiae; frequentist or ML regularization is equivalent to a prior and “maximum a posteriori” estimation. It is not equivalent to bayesian inference in the sense of integrating over a posterior distribution. It can be argued that maximum a posteriori estimation is not a bayesian technique at all.

      • According to all known laws of aviation, there is no way a bee should be able to fly. It’s wings are too small to get its fat little body off the ground.

        MAP estimation in a bayesian interpretation produces arbitrary results. You can always reparameterize your problem to produce an equivalent probability space where the MAP estimate is completely different. This is why stan does not have a “map” method, only a “penalized maximum likelihood” which does not apply jacobians to reparameterizations.

        The bee, of course, flies anyway, because bees don’t care what humans think is impossible.

        The MAP estimate can still outperform by many common metrics of performance

        https://statmodeling.stat.columbia.edu/2020/02/13/how-good-is-the-bayes-posterior-for-prediction-really/

        • Somebody:

          If you really think that “according to all known laws of aviation, there is no way a bee should be able to fly,” then you’re a couple known laws of aviation short of a load.

        • I wouldn’t say MAP estimation is entirely arbitrary. Is it possible to reparameterize a model and shift the MAP estimate? Yes… I’m not sure it’s possible to reparameterize the model and shift the pre-image of the new MAP estimate **arbitrarily far** from the estimate in the original space… Both are going to be within the high probability region right? at least intuitively it seems like that to me.

        • If “arbitrarily far” means “outside of the support of the distribution” that’s indeed out of reach. Estimates will remain in the “non-zero probability region”.

          If that’s not what you meant where do you think that the limit would be? How do you define that “high probability region”?

        • So suppose you have a parameter q which has a posterior distribution that is maybe approximately normal(q*,1), now you define an invertible transformation of that parameter Q = f(q) with g(Q) being the inverse transformation.

          Now you define a prior on Q, and run your bayesian machinery and get a MAP value Q*

          Now g(Q*) is not going to be q* the original MAP value. But if the priors on q and on Q are not crazy different, is it the case that you can make g(Q*) be “far away” from the high probability region of normal(q*,1)?

          Let’s say high probability region is the smallest region containing 99% of the probability in each space, and “far away” in this sense is a value for example that is hundreds of standard deviations away from q*.

        • You can do a cheating surgery. You can take a set of measure zero and set it to any finite density at that point, and it’ll be the maximum but all integrals will be preserved.

          But in practice we need a diffeomorphic reparameterization.

        • > I’m not sure it’s possible to reparameterize the model and shift the pre-image of the new MAP estimate **arbitrarily far** from the estimate in the original space… Both are going to be within the high probability region right?

          > Let’s say high probability region is the smallest region containing 99% of the probability in each space, and “far away” in this sense is a value for example that is hundreds of standard deviations away from q*.

          I don’t get it. Is that 100 sigma bound somehow related to the 99% or is it an absolute bound?

          Do you mean that it may be possible that if the 99% is raised to 99.99% the bound needs to be extended to say 102 sigma?

          Wouldn’t in that case a MAP at 101 sigma obtained by some reparametrization be a) far away using your definition and b) not within the 99% HDR in the original space?

        • (I misread your “hundreds of standard deviations” as “a hundred standard deviations” but I think the point remains valid.)

        • somebody, yes I considered only at least continuous transformations. It’s not possible to numerically find a discontinuous point anyway.

          Carlos, I’m focusing on situations where the effective model is the same or extremely similar all that’s changed is the parameterization. That’s enough to change the MAP, but not enough to sort of “separate” the high probability distributions. Of course if you place a prior on the new Q space which is not in general similar to the prior equivalent to the original one on q space, then you can move the MAP wherever.

          For example q with prior normal(0,1) and Q= q+1 with prior normal(1000000,1)…

          In this case I would say only priors strongly overlapping the HDR of normal(1,1) would be “allowed”. Let’s say that’s the region [-3,4] in Q space for this example.

          You can move the MAP around without a reparameterizatiin if you can change the prior on the original space. That’s neither controversial nor interesting.

        • I might be doing this wrong, but consider the coordinate transformation

          theta’ = (1/ sqrt(sigma)) erf((theta – c) / sigma)

          Jacobian would be a gaussian centered at c that increases without bound with inverse sigma, so for any bounded density you can make c the MAP.

        • > I’m focusing on situations where the effective model is the same or extremely similar all that’s changed is the parameterization. That’s enough to change the MAP, but not enough to sort of “separate” the high probability distributions.

          I’m not sure if that’s what you mean, but I’m thinking of applying the same reparametrization everywhere. The only thing that matters is the posterior probability.

          For a single variable, the probability that q is between q1 and q2 is the same as the probability that Q is between Q1=f(q1) and Q2=f(q2). It seems to me that one can find a well-behaved transformation f(x) that squeezes some part of the parameter line and spreads another as much as we want.

          I don’t see why the maxima of the densities should be close – in the sense of Qmax being close to f(qmax) – and we know that the high-probability regions will overlap – they have to if their coverage is over 50% – but the overlap region doesn’t necessarily include the maxima.

        • Wait, I did that wrong. I think

          (theta – c) = Erf[theta’] / sigma + theta’

          will work, though I can’t express the inverse in elementary functions. The previous construction was backwards, and also had a bounded codomain

        • somebody: I haven’t looked specifically at your transform yet.

          Carlos: Yes, you can map different parts of the real line of q space into arbitrarily far away parts of the real line in Q space… but **when you map them back to q space** they won’t be arbitrarily far from the mode in q space. All of the high probability region in the posterior is represented by places in q space that are “close”

          Basically if in q space your posterior is normal(0,1) and the model is a pure reparameterization to Q space where the region -10 to 10 in q space is mapped onto say 0,inf in Q space… sure in Q space the mode could be at 1million or something…

          but when you bring it back to q space, it doesn’t change the fact that the “high probability region” in q space is sort of -4,4 and whatever your mode at Q=1M was is going to be somewhere in that -4,4 region…

          unless of course you fiddle with regions of essentially zero measure… make just the point Q = 1M have density 1e10 for example as suggested by “somebody”… You could do a similar thing, put .0001 of the probability into a region of width 1e-100 in Q space, it’ll be a super high spike…

          So I guess the requirement is really that you have a Lipshitz transformation with a reasonable bound on the absolute value of the derivative, so that you can’t stuff probability into an arbitrarily thin region.

        • Daniel,

          The key point here is that it collapses a region of space in the neighborhood of c with intensity inversely proportional to sigma, but is otherwise linear. Precisely, the jacobian of this transformation is the (gaussian(theta’) / sigma + 1) so the density blows up if you make sigma small. So you can make any point the computable MAP using a lipshits continuous transformation (lipshits WRT any fixed sigma).

          Yeah, if you impose reasonable upper bounds on the absolute derivative, that won’t be the case. But in doing so, you’re privileging one coordinate system. In other words, who’s to say theta’ isn’t the good coordinate system and f^-1(theta’) the forbidden transformation with the poorly behaved jacobian? This is fine if the parameters are physically meaningful as you describe elsewhere. But if they’re coefficients on the QR decomposed data matrix or weights of a neural network? The philosophy typically touted here and on the stan discourse is that the parameters are a device for computing posterior predictive distributions. Only tyrants and cowards choose a basis.

          I guess the interesting question does the jacobian correction ever actually improve things in a maximization; is there a class of problems where the MAP performs better than the penalized MLE?

        • Well yes to be fair I hadn’t considered that possibility of some enormously high Spike of extremely narrow width. If you do that you can put the high Spike well outside the high probability region in q by compressing a tiny amount of probability in the detail of q into an extremely tiny region of capital Q space .

          But it requires an extremely weird transform. Like the inverse transform has a derivative of 10^1000 or something.

          I suppose that sort of thing might happen on the boundary if you’re compressing the whole real line into a compact subset. I’m guessing maybe that’s what somebody was talking about

        • Also note that you can’t in practice calculate with those sorts of transforms using floating point numbers so from a perspective of are map estimates arbitrary when carried out by a non-adversarial analyst using floating point numbers on a computer they’re not really

        • somebody:

          Suppose you’ve got a model with a parameter q whose posterior is normal(0,1) ish…

          now you create a transform Q = {f(q) = q for q exp(1000)}

          it’s basically just y=x until you hit 100, then it increases at a ridiculously slow rate for “all of the real line well beyond what’s representable in a float” then it’s a y=x transform with an offset again. In other words it compresses the ridiculously small probability out beyond 100 in a normal(0,1) RV into an even more ridiculously small space so that in fact the density near that point Q=100 is locally extremely high.

          To say that this is “bad” is preferentially treating coordinates, yes, but… it’s objectively bad. This kind of transformation can’t be computed with on a computer, and therefore it’s bad, and this transformation puts a very high density peak in a region where total probability is 0 to hundreds or thousands of decimal places. Those things objectively make it bad.

          The kinds of transformations we want to work with in ALL applied work are those that don’t go anywhere near the boundaries of the finite floating point number system, and don’t put arbitrarily large probability density in regions where for “large” distances around there is arbitrarily small total probability. I think those are reasonable criteria.

          But, mathematically, yeah, you can in fact do it. So thanks for pointing it out!

          Going back to your original statement though, i’d say I really don’t like MAP estimates when I can do Bayes, but if I have to go with a single number, I’d take a MAP estimate based on a quality data set on a parameter space that is reasonably well constructed over an a-priori guess by some random “expert” most days of the week.

        • damn you WordPress! you ate my less than and greater than signs…

          the transform was {f(q) = q for q less than 100… f(q) = 100+(q-100)/exp(100000) for q from 100 to exp(1000)… etc}

          the transform compresses “the whole real line from 100 to above what floats can represent” into a tiny space near 100…

        • I agree that some coordinate systems are better than others. I don’t have any theoretical language to justify it though.

          But I don’t think the practical problems are as meager as you say they are. The peak log probability of the standard normal is like -0.399. The log jacobian determinant doesn’t have to be that big to dominate over a couple hundred data points. And people logit map compact sets onto Rn all the time; it’s not that hard to end up in the high distortion boundaries, especially with a poorly thought out prior.

          All this to say, in cases where I can only use a maximizing solution, I’m struggling to find a reason not to abandon the reverend, drop the jacobian, and just MLE it.

    • I’ve heard this so many times now and it never made sense to me.
      To take your example, L2 regularization is equivalent to Bayesian estimation with independent normal priors if and ONLY IF the hyperparameters are chosen independently of the data. But the parameters are essentially always chosen through some data-dependent procedure, at least for the most common use case of ridge regression (e.g. minimizing mean squared prediction error through cross-validation). The hyperparameters then do not correspond to a Bayesian priors anymore, since a prior can, per definition, not depend on the data.

  5. Bayesian inference is when you decide which parameter values to pay attention to based on both information that existed when you built your model and new data, using the mathematics of probability.

    It could be estimation of a mean, simple… Or finding the values for 100M weights in a neural network… Complex…

  6. In principle, one might be able to usefully define Bayesian machine learning as a set of procedures to “directly” obtain the posterior predictive distribution of the outcomes in the (testing) data without first obtaining the posterior distribution of the parameters given the (training) data. Such procedures do not currently exist, except sort of in the rare case where the posterior distribution is naturally conjugate with the prior distribution and you can derive the posterior predictive distribution by analytically marginalizing over the parameters. In the absence of such a procedure to directly obtain the posterior predictive distribution, you see a fork where Bayesians forgo the directness and machine learners forgo the distribution. By that I mean Bayesians get at the posterior predictive distribution in two steps; first draw from the posterior distribution of the parameters and then use each of those parameter realizations to draw from the predictive distribution of the outcome. In contrast, machine learners usually do not draw from any conditional probability distribution, especially a posterior distribution of the parameters.

    • To reply to myself regarding things that do not exist, there was a paper

      https://rss.org.uk/RSS/media/File-library/Events/Discussion%20meetings/Preprint_Fong-et-al_12-Dec-2022.pdf

      that was recently read at the Royal Statistical Society, which re-derives Bayesian inference using an updating rule for the predictive distribution of an infinite amount future data given a finite amount of past data. In other words, rather than taking the prior distribution over the parameters as a starting point, multiplying by the likelihood, and integrating over the joint distribution of parameters and data to get the denominator of Bayes Rule, the authors take the updating rule for the predictive distribution as primitive and infer the rest of Bayes Rule.

      So, if there were to be a synthesis between Bayesian methods that focus on the posterior distribution and machine learning methods that focus on the process of prediction, I think this paper could be a starting point, but I have a hard time seeing how the updating rule would work in more complicated examples.

  7. My personal view on the difference is this: if you’re doing statistics (including Bayesian statistics), you usually care about your model parameters and want to know what they are. Whereas if you’re doing machine learning, all parameters are nuisance parameters — you usually don’t really care what they are, you just want a good prediction.

    • Cs:

      Sure, but that just pushes it back one step. If I’m doing statistics, for example modeling elections or pollutants or the dose-response function of a drug or whatever, my ultimate goal almost always is predictions—indeed, causal inferences are just predictions under counterfactual assumptions—and the way I get to this point is by estimating parameters. As de Finetti might have put it, parameters are just a way to get to those predictive inferences, indeed I could express all my Bayesian modeling in terms of observables.

      To put it another way: when we “do statistics,” we also care about prediction. We do it through parameters because that works. And when we “do machine learning,” we also care about parameters. We can just express these in terms of predictive quantities.

      • Sometimes the parameters are fundamental aspects of reality you can’t observe but care about… (for example, maybe the viscosity of a fluid, or the temperature of an object you are observing with a telescope) other times they’re a representational convenience (basis coefficient for the representation of a force vs time curve). Both are “doing statistics”. If you’re doing Bayesian statistics you are putting a probability distribution over them. Thats what makes it Bayesian.

      • I guess it’s true that ultimately everyone does care about downstream usefulness. But I do think there’s a real distinction in the paths you can take to get from data+assumptions to downstream usefulness, and this captures something of what people mean when talking about “statistics” vs. “ML”.

        I actually don’t understand what your last sentence means, and suspect I’m missing something interesting there.

        I don’t think “big data/minimal assumptions” can account for all of what distinguishes machine learning, especially pre-2012 style ML. For example boosting can be used on small datasets, and can bake a decent amount of structure into the model if you want. But it came out of the ML community, motivated by theoretical questions around PAC learning, and it’s somehow very ML in spirit.

      • To put it another way: when we “do statistics,” we also care about prediction.

        The vast majority of statistical models (Bayesian or not) are published once, some inference is drawn about the parameter values, then that model is never used again. The next study will use a new model, at least a new set of coefficients/parameters derived from new data.

        This is very different from ML where the goal is to plug new data into the same model repeatedly in order to predict new outcomes.

        Seems to me like a clearcut distinction between “inference” and “prediction”.

    • I think that is a very limited view of ML – that “you usually don’t really care what they are, you just want a good prediction.” Most ML models do provide measures of the most important variables as well as how they affect the predictions and these are (or at least should be) important aspects of the model.

  8. @Andrew You are way more qualified than I am to comment on Bayesian inference but I am still puzzled by your bottom line on Bayesian inference v.s. ML. To me, it represents more how things currently are in practice rather than how they are fundamentally.

    Whether we are talking about a small GLM or a >>1M parameters neural network, Bayesian inference is the thorough way of defining how to learn what parameter values explain the data. In the case of common statistical models, one can often evaluate the result of Bayesian inference applied to the parameter estimation problem; in the case of >>1M parameters neural networks, most give up and resort to MLE estimates obtained using gradient descent. But by doing so:
    1) they are unable to account the epistemic uncertainty associated with the limited amount of data. Even if you have lots of training data, complex problems require large models (e.g., >>1M parameters) so that the amount of data available remains small for a large number of edge cases. In applications such as health or autonomous driving, you cannot afford to be wrong 1% of the time while being unaware of it.
    2) they are unable to learn online as data comes in, or to adapt to changing conditions.
    3) they have a limited capacity to couple statistical and ML models to have the best of both worlds.

    On the one hand, as you pointed out, ML and statical modelling are two different paradigms to build models. On the other hand Bayesian inference is the probabilistic framework common to both paradigms to learn the parameter values. The field of statical modelling is mature enough to rely on Bayesian inference for practical applications, ML is not. In my opinion, ML will have to get away from gradient descent and rely on Bayesian inference for practical applications, even when dealing with large datasets. We are not there yet.

    • The problem here in my view is that “accounting for epistemic uncertainty” the Bayesian way will require an epistemically meaningful prior. As long as you don’t know how to specify this for the >1M parameters of a deep network, say, Bayesian inference can’t properly account for epistemic uncertainty either.

      Note also that many Bayesian priors that are good at regularisation and therefore really useful for ML tasks in fact are not epistemically meaningful (or at least those who use them don’t usually explain whether or why there are).

      • Christian

        I wonder what is the criterion for “epistemically meaningful” because I do not see the issue.

        When defining the prior for a Bayesian neural network (irrespectively of the number of parameters and layers) you start with the assumption that your inputs and outputs will be standardized, i.e., N(0,1). This constrains your prior for the weights and biases on each layer as a function of the specific architecture and number of hidden units on the subsequent layer. The result is that before training you obtain N(0,1) outputs for N(0,1) inputs.

        Although the posterior for a specific weight is not interpretable like the one for the slope “a” of a model y=ax+b, it remains representative of how much parameter uncertainty remains after training. Just like for any other model, the epistemic uncertainty about the weights and bias will be included in your predictions. In addition, if you later add new data, that posterior you obtained for the first set of data becomes your new prior so you do not have to restart from scratch.

        With that in mind, my opinion is that such a prior is both “epistemically meaningful” and good at regularization.

        • James: Sorry, I can’t really follow. As far as I’m concerned (which is in line with what Wikipedia says on that matter;-), epistemic uncertainty refers to knowledge (or lack of it) that we have about the real system that is modelled. Nothing of your text above refers to this, so I can’t bring it together with what I think is epistemic uncertainty (and I’m not alone).

  9. I recently read a book that might be relevant to this thread. It’s called Thinking About Statistics: The Philosophical Foundations, by Jun Otsuka. The focus is on the semantics, ontology, and epistemology of Bayesian stats, Frequentist stats, causal inference, and machine learning. Also, the book is short, just the way I like them.

Leave a Reply

Your email address will not be published. Required fields are marked *