Last week’s summer school on probabilistic AI

(this post is by Charles)

Last week, the Nordic Summer School on Probabilistic AI took place in Copenhagen. I was fortunate to attend some of it (3 out of 5 days), and teach half a day on Monte Carlo methods. All the course material is available online. This includes slides, and extensive code demo and exercises. I believe recordings of the lectures will be released.

I’d like to share some thoughts/ideas that came up in class and in the hallway, particularly on the topics of variational inference (VI) and Markov chain Monte Carlo (MCMC). This by no means covers all the subjects taught during the summer school; these are simply two topics close to home for me. (The blogger’s bias)

VI: the main protagonist

VI played the more prominent role: not only was there a dedicated full day on the topic, many methods during the session on deep learning used VI for model training.

Not all of VI is black box

I enjoyed revisiting VI from a more classical perspective, notably coordinate ascent VI (CAVI), which requires tailoring the approximating family to the model being fitted. Contrast this with black box VI, where we try to come up with a single variational family, which (hopefully) works well across a range of models. Of course, in the context of the Bayesian workflow, black box algorithms are desirable. But where these fall short, more bespoke solutions can be worth pursuing.

A small note on estimating uncertainty with VI

The discussion did not shy away from VI’s potential shortcoming when estimating the uncertainty of a target distribution. Early exercises examined how well VI estimates precision. I found this choice interesting, since I’m more used to thinking about variance (or the even more intuitive standard deviation). In the context of VI, the distinction matters: as my colleagues, Lawrence Saul and Loucas Pillaud-Vivien, and I recently showed, VI can simultaneously estimate precision very well and variance very poorly! (see here; see also the classic by Turner and Sahini (2011)).

VI for full Bayes and for maximum likelihood estimation

I also noted two distinct applications of VI. First, we approximated the posterior distribution of a Bayesian regression, and our goal was to estimate the posterior mean and precision of interpretable parameters.

Secondly, we did maximum likelihood estimation (MLE), with the usual trick of maximizing the evidence lower bound (ELBO). In this latter case, VI was used to marginalize out latent variables and approximate a marginal likelihood, in the manner of an expectation-maximization (EM) algorithm. Here, there was no immediate requirement for VI to quantify uncertainty — could this explain why VI works well in this type of application? (Something to dig deeper).

The variational autoencoder (VAE) served as a canonical example. The goal was to point estimate the weights of the “decoding” neural network, by maximizing the likelihood after marginalizing the latent variables. Jes Frellen, the instructor, mention an application in which the authors use a Bayesian VAE (Daxberger et al, 2019), i.e. train a Bayesian neural network. This brings us back to the full Bayesian case, and I added the paper to my reading list.

ML methods need to be broken down into a model and a training procedure

During the deep learning module, Jes Frellen carefully distinguished the modeling and the training (“inference” in statistics jargon). Many ML methods (e.g, VAE) include both a model (e.g., the “decoder”) and a learning method (e.g. the “encoder” or the “amortized VI”). It is enlightening to separate the two: we might for example try the same deep learning model with a different inference algorithm. Conversely, we might realize that the inference algorithm in the VAE, i.e. amortized VI, can actually be applied to a broader range of models (see for example Agrawal and Domke (2021); also my paper with Dave Blei).

And now: Monte Carlo methods

In the midst of all this came the time to the learn about Monte Carlo methods :) I had a lively lunch conversation with students about whether Monte Carlo methods scale in high dimensions. (It would be worth writing a dedicated blog post on the subject; for now, I’ll simply state the short answer: no, there is no intrinsic curse of dimensionality. The longer answer is… longer). The module I taught covered standard Monte Carlo (for example to estimate the ELBO), Markov chain Monte Carlo (MCMC), and importance sampling (to estimate leave-one-out predictions based on posterior samples… from MCMC or from VI…).

In principle, MCMC and VI solve the same problem, and so we ended the module with a high-level comparison between the two approaches. There is a somewhat general consensus that, subject to strict computational budget (i.e. limited computation relative to complexity/size of the model) VI can achieve better results. However, as computation increases, MCMC eventually achieves a smaller error. This corresponds to the sketch in our Bayesian Workflow paper (Figure 5). Two students presenting at the poster session (Devina Mohan and Benhard—whose last name I didn’t write in my notes… :’( ) observed this trade-off when applying VI and MCMC to their problem.

I believe this trade-off needs to be formalized and better understood. I’m not aware of any theoretical or even conceptual justification for why we might expect this trade-off between VI and MCMC, even if empirical evidence exists…

The other thing I highlighted, building on previous lectures, is that with MCMC we can separately check the inference (using diagnostics such as R-hat, ESS estimates, etc.) and check the trained model (with posterior predictive checks, cross-validation, etc.). In VI, we are typically confined to only doing the second. If the checks for the trained model are good, that’s great; and who cares about whether the posterior inference is accurate? (One might argue.) On the other hand, if the checks fail, we will not know whether the fault lies in the model or in the inference, or in both. To go back to the VAE: do we have a problem with the decoder or with the encoder?

That said, they are promising approaches to diagnosing the inference of VI: see Yao et al (2018), Huggins et al (2020), Biswas and Mackay (2024), and of course, good old (expensive) simulation based calibration checking (Talts et al, 2018).

Some other notes

The book Deep Generative Modeling by Jakub Tomczak was recommended by several speakers. I myself have this book on my desk. I’ve only read it selectively but every section I worked through was excellent and I definitely have a lot to learn from this textbook.

Ole Winther taught a one-hour crash course on stochastic calculus (!!), explaining how SDEs arise as a limiting case of ladder VAEs (wow!). I really appreciated the derivation of Ito’s rule.

Antonio Vergari introduced probabilistic circuits as a “love letter to mixture models”. This demystified the subject and left me wanting to learn more. Several students I spoke with after class expressed a similar sentiment.

Copenhagen was really beautiful and a good deal of fun. I hope to visit again.

6 thoughts on “Last week’s summer school on probabilistic AI

  1. Hi Charles, about the computational-statistical trade-off aspect of VI, there have been a lot of recent new works. The earliest paper that empirically explored this is the following paper by Matt Hoffman and Yi-an Ma:

    https://proceedings.mlr.press/v119/hoffman20a.html

    More formal theoretical results appeared recently:

    https://arxiv.org/abs/2207.11208
    https://arxiv.org/abs/2404.09113
    https://arxiv.org/abs/2305.15349
    https://arxiv.org/abs/2401.10989

    Each of these papers shows that restricting the variational family (or regularizing the objective for the entropic regularization case) results in stronger convergence guarantees, which implies that VI exhibits a computational-statistical trade-off. Although it is not theoretically clear how much one pays “statistically,” when making a trade, I think your recent works are partially answering this part.

    • Charles was an intern with Matt, and discussion with Matt before that motivated a lot of both Charles’s and my work since then (as well as Matt’s work on parallelizable samplers). Matt’s contention was that short-chain Hamiltonian Monte Carlo (HMC) in the form of the no-U-turn sampler (NUTS) is going to be about the same cost as black-box VI (BBVI) and get similar results. We confirmed that in our Pathfinder paper, where we compare running autodiff variational inference (ADVI) to convergence and running HMC for 75 iterations with a unit metric over 50 or so models in posteriordb (we cut the number we reported in the paper down a bit).

      On the other hand, we showed that Pathfinder was faster than both NUTS and ADVI even without the massive parallelization it admits.

      See: https://jmlr.org/papers/v23/21-0889.html

    • Hi Kyurae, hi Bob,
      Thank you for the comments!

      I really like the Hoffman and Ma paper. From this paper, my rough intuition is that (i) for MCMC, we need a long chain to both reduce the bias and the variance; (ii) for VI, our approximation is deterministic and we’re only reducing bias, so the error is smaller at first (I’m disregarding variance due to stochastic optimization). If you run a lot of chains, you kill off the variance, and now both methods solely focus on reducing bias, with MCMC being asymptotically unbiased.

      I think we need to investigate more the use of short biased Markov chains: can this help us in problems where the results of BBVI are satisfactory…? And indeed, the study in the pathfinder paper is already quite extensive.

      Regarding the other references: Bohan Wu’s paper with Dave Blei is on my reading list (luckily Bohan and Dave are both regulars at the Flatiron Institute).

      Thank you for the other references, I’ll take a look. (I was familiar with the proof of convergence by Domke and colleagues (https://arxiv.org/abs/2306.03638), though I understand several groups got to this result at the same times.)

  2. I’m sorry I missed this. Charles is a really great instructor at all levels (beginner to advanced) and he came back raving about how good the other classes were and how much discussion came out of it (one sign of a good class!).

    We know simulation-based calibration (SBC) will reject if we use a VI method that doesn’t achieve KL-divergence of zero. But it can still be useful for doing an error analysis and seeing where the model is biased or under- or over-dispersed. There are examples in the SBC paper.

    • Bob:

      Just one update: we’re now using the phrase “simulation-based calibration checking,” because the SBC procedure checks calibration; it does not actually calibrate. And then we’re using the phrase “recalibration” for procedures that attempt to calibrate, as in this paper.

  3. > Of course, in the context of the Bayesian workflow, black box algorithms are desirable.

    In many parts of Bayesian workflow, black box algorithms are convenient, but it’s also part of the workflow to assess whether model specific and possibly more approximate inference is needed for sufficient speed.

Leave a Reply

Your email address will not be published. Required fields are marked *