Anything you can do with Bayesian inference you can do in other ways. Bayesian inference is a bit like calculus: You can do derivatives and integrals without calculus (indeed, mathematicians in pre-Newtonian times were able to compute limits, with care), but calculus makes it a lot easier. Similarly, I find that Bayesian inference makes it a lot easier to combine information.

This came up in comments a few years ago:

Anything you can do with Bayesian inference you can do in other ways. Bayesian inference is a bit like calculus: You can do derivatives and integrals without calculus (indeed, mathematicians in pre-Newtonian times were able to compute limits, with care), but calculus makes it a lot easier. Similarly, I find that Bayesian inference makes it a lot easier to combine information. For example, I’m sure that someone could do MRP non-Bayeisanly–and indeed there is a non-Bayesian tradition of partial pooling for small-area estimation in sample surveys–but I think it’s no coincidence that the widespread use of MRP has come along with the Bayesian approach.

If you look at my applied research papers, you’ll see a lot of analyses that maybe could’ve been done in non-Bayesian ways but in fact which my colleagues and I did Bayesianly, and which I suspect would never have been solved had we not had Bayesian tools.

There are also a lot of non-Bayesian success stories in statistics, and that’s fine too.

Bayesian inference is many things. It’s a set of tools for solving problems, also a framework for understanding statistical methods. Other statistical approaches similarly serve this dual duty, for example classical hypothesis testing is a set of methods and also a framework in which statistical inference is viewed as a set of testing problems. I don’t find that particular framework very helpful–indeed, I think it often gets in the way–but I do recognize that there are many problems for which methods developed in that tradition can be useful. Recall our discussion of lasso.

6 thoughts on “Anything you can do with Bayesian inference you can do in other ways. Bayesian inference is a bit like calculus: You can do derivatives and integrals without calculus (indeed, mathematicians in pre-Newtonian times were able to compute limits, with care), but calculus makes it a lot easier. Similarly, I find that Bayesian inference makes it a lot easier to combine information.

  1. Dear Professon Gelman,

    Thank you for interesting topic. I would like to add an opinion.
    If a statistical model has latent variables or a hierarchical structure, Bayesian statistics provides a more natural and accurate inference compared to other methods. Of course, other methods can be used as well, but additional careful design is necessary to prevent parameter divergence. With Bayesian statistics, you can freely experiment with any prior distribution you like.

    • Sumio:

      I agree. Another way of putting it is that when non-Bayesian methods deal with latent variables, they will often use something like conditional Bayesian inference. For example if you use lme4 to fit a multilevel model, it gives a problematic point estimate of the variance parameters (see here and here for discussion of the problems, along with possible solutions), but the inference for the intermediate parameters–the so-called random effects–is Bayesian, conditional on the variance parameters.

  2. [edit: submitted before I was done]

    Given a target density p(a, b), I thought lme4 did this:

    a* = ARGMAX_a p(a)
    b* = ARGMAX_b p(a*, b)

    Is that not right? If it is right, what’s Bayesian about this?

    I would say that if you’re calculating derivatives and integrals, you are doing calculus. That’s just the definition. Andrew could lobby for a different definition of “calculus” than is in English dictionaries, but that’s a different blog post.

    I think I understand what Andrew means by this post, but find its statement confusing. It’s not that you can literally do anything you can do in Bayes in another framework either numerically or in terms of interpretation, it’s that you can do computations that can lead to similar conclusions. Chapter 3 of BDA has more detail on lining up frequentist confidence intervals and Bayesian posterior intervals in a more technical way, but that’s only possible in some simple cases.

    Looking it up, I’m curious what Andrew thinks today about this from the first paragraph of Chapter 3:

    The extent to which a noninformative prior distribution can be justified as an objective assumption depends on the amount of information available in the data: …

    I’m curious if Andrew would write that if he was rewriting BDA from scratch in 2025. I’ve also always been confused by what people mean by “noninformative”. In this case, I think Andrew’s just comparing the information in prior vs. likelihood asymptotically as the data grows without bound.

    I characterize the Rubin/Gelman philosophy as pragmatic Bayes to contrast it with both objective Bayes (reference priors only) and subjective Bayes (my one true prior only). To bring this full circle, one of the primary aspects of pragmatic Bayes is its focus on frequentist calibration of posterior predictive inference.

    • Bob:

      I’m not sure what you mean by a and b in your notation. I think that what lme4 does is optimize the marginal likelihood of the variance parameters, integrating out the linear parameters (doing an approximate integral if the model is not normal and linear). Inference for the hyperparameters is marginal maximum likelihood, so not particularly Bayesian (except that it can be interpreted as an approximation to the posterior distribution with a flat prior). Inference for the linear parameters is Bayesian conditional on the point estimates of the hyperparameters.

      P.S. When I say computing derivatives and integrals without calculus, I’m referring to the solution of problems such as computing the volume of a pyramid or sphere: these are problems that can be solved by computing integrals, or they can be solved without reference to calculus, as the ancient Greeks did it.

      • Sorry, Andrew—I wasn’t finished writing that comment and submitted too early. I asked ChatGPT and what it suggested is the following, which doesn’t match what either of us said (unless I was misunderstanding you). Here a is the vector of high-level hierarchical variance parameters, b is the vector of lower-level regression coefficients, and y is the observed data.

        a* = ARGMAX_a p(y | a)
           = ARGMAX_a INTEGRAL p(y, b | a) db
           = ARGMAX_a INTEGRAL p(y | b, a) p(b | a) db
        
        b* = ARGMAX_b p(y | a*, b)
        

        I think the p(b | a) is the Laplace approximation, but I really can’t make the notation line up. Here’s Doug Bates’s (lme4’s author’s) 100+ page paper describing lme4, which is presented in a frequentist and linear algebra language I find very challenging: https://people.math.ethz.ch/~maechler/MEMo-pages/lMMwR_2018-03-05.pdf

        • Bob:

          I don’t know what a, b, a*, and b* are in your notation. I think the description in my above comment is correct, although I’m not 100% sure whether the unmodeled coefficients (the so-called fixed effects, i.e. the linear parameters that are given flat priors in the Bayesian interpretation) are integrated out to obtain the point estimate of the variance parameters. Conditional on the variance parameters, lme4 is giving Bayesian inferences, but I don’t think it produces posterior draws; I think it just spits out posterior means and standard deviations (again, conditional on the point estimate of the variance parameters). When writing my book with Jennifer, I wrote programs to produce posterior draws (again, conditional on the the point estimate of the variance parameters), but now I’ll just fit these models using rstanarm. I’m hoping the new nested Laplace in Stan will allow us to fit these models fast, thus providing a functional replacement for lme4. (When lme4 works, it’s fine, but (a) it doesn’t account for uncertainty in the variance parameters, and (b) it’s not so easy to include priors on the variance parameters. We do some of this with our blme package, which alters lme4 to include priors, but it’s still kind of awkward. Also, (c) lme4, like other point estimation methods, sometimes has computation problems.)

Leave a Reply

Your email address will not be published. Required fields are marked *