## Dan’s Paper Corner: Yes! It does work!

Only share my research
With sick lab rats like me
Trapped behind the beakers
Cut off from the world, I may not ever get free
But I may
One day
Trying to find
An antidote for strychnine — The Mountain Goats

Hi everyone! Hope you’re enjoying Peak Libra Season! I’m bringing my Air Sign goodness to another edition of Dan’s Paper Corner, which is a corner that I have covered in papers I really like.

And honestly, this one is mostly cheating. Two reasons really. First, it says nice things about the work Yuling, Aki, Andrew, and I did and then proceeds to do something much better. And second because one of the authors is Tamara Broderick, who I really admire and who’s been on an absolute tear recently.

Tamara—often working with the fabulous Trevor Campbell (who has the good grace to be Canadian), the stunning Jonathan Huggins (who also might be Canadian? What am I? The national register of people who are Canadian?), and the unimpeachable Ryan Giordano (again. Canadian? Who could know?)—has written a pile of my absolute favourite recent papers on Bayesian modelling and Bayesian computation.

Here are some of my favourite topics:

As I say, Tamara and her team of grad students, postdocs, and co-authors have been on one hell of a run!

Which brings me to today’s paper: Practical Posterior Error Bounds from Variational Objectives by Jonathan Huggins, Mikołaj Kasprzak, Trevor Campbell, and Tamara Broderick.

In the grand tradition of Dan’s Paper Corner, I’m not going to say much about this paper except that it’s really nice and well worth reading if you care about asking “Yes, but did it work?” for variational inference.

I will say that this paper is amazing and covers a tonne of ground. It’s fully possible that someone reading this paper for the first time won’t recognize how unbelievably practical it is. It is not trying to convince you that its new melon baller will ball melons faster and smoother than your old melon baller. Instead it stakes out much bolder ground: this paper provides a rigorous and justified and practical workflow for using variational inference to solve a real statistical problem.

I have some approximately sequential comments below, but I cannot stress this enough: this is the best type of paper. I really like it. And while it may be of less general interest than last time’s general theory of scientific discovery, it is of enormous practical value. Hold this paper close to your hearts!

• On a personal note, they demonstrate that the idea in the paper Yuling, Aki, Andrew, and I wrote is good for telling when variational posteriors are bad, but the k-hat diagnostic being small does not necessarily mean that the variational posterior will be good. (And, tbh, that’s why we recommended polishing it with importance sampling)
• But that puts us in good company, because they show that neither the KL divergence that’s used in deriving the ELBO or the Renyi divergence is a particularly good measure of the quality of the solution.
• The first of these is not all that surprising. I think it’s been long acknowledged that the KL divergence used to derive variational posteriors is the wrong way around!
• I do love the Wasserstein distance (or as an extremely pissy footnote in my copy of Bogachev’s glorious two volume treatise on measure theory insists: the KantorovichRubinstein metric). It’s so strong. I think it does CrossFit. (Side note: I saw a fabulous version of A Streetcar Named Desire in Toronto [Runs til Oct 27] last week and really it must be so much easier to find decent Stanleys since CrossFit became a thing.)
• The Hellinger distance is strong too and will also control the moments (under some conditions. See Lemma 6.3.7 of Andrew Stuart’s encyclopedia)
• Reading the paper sequentially, I get to Lemma 4.2 and think “ooh. that could be very loose”. And then I get excited about minimizing over $\eta$ in Theorem 4.3 because I contain multitudes.
• Maybe my one point of slight disagreement with this paper is where they agree with our paper. Because, as I said, I contain multitudes. They point out that it’s useful to polish VI estimates with importance sampling, but argue that they can compute their estimate of VI error instead of k-hat. I’d argue that you need to compute both because just like we didn’t show that small k-hat guarantees a good variational posterior, they don’t show that a good approximate upper bound on the Wasserstein distance guarantees that importance sampling will work. So ha! (In particular, Chatterjee and Diaconis argue very strongly, as does Mackay in his book, that the variance of an importance sampler being finite is somewhere near meaningless as a practical guarantee that an importance sampler actually works in moderate to high dimensions.)
• But that is nought but a minor quibble, because I completely and absolutely agree with the workflow for Variational Inference that they propose in Section 4.3.
• Let’s not kid ourselves here. The technical tools in this paper are really nice.
• There is not a single example I hate more than the 8 schools problem. It is the MNIST of hierarchical modelling. Here’s hoping it doesn’t have any special features that makes it a bad generic example of how things work!
• That said, it definitely shows that k-hat isn’t enough to guarantee good posterior behaviour.

Anyway. Here’s to more papers like this and to fewer examples of what the late, great David Berman referred to as ceaseless feasts of schadenfreude“.

1. Thanks for all your very kind words Dan! To clarify one point regarding your “slight disagreement”: we actually agree Wasserstein isn’t the right way to check if IS will work well. However, we do think 2-divergence *is* is a good way. That’s why we advocate making sure both are small! (In particular, see step 9 of our proposed workflow.) We should actually add a reference to Diaconis and Chatterjee. In fact, their IS error bound is in terms of KL, which is upper-bounded by the 2-divergence we are bounding.

• Aki Vehtari says:

I agree it’s an excellent paper!

In a hurry quick comment that uou should cite the latest version of PSIS-paoer with Dan and Yuling included. You do mention C & D connection above which is also in the new version, but the new version has also discussion about when k<0.5 by construction, finite sample khat>0.5 is useful diagnostic for pre-asymptotic behavior.

2. Andrew says:

Dan:

I like how you give out such nice compliments to everybody! It’s a good counterpoint to my typically acerbic posts.

• Dan Simpson says:

The benefit of not blogging frequently is that I can just talk about stuff I like :p

• Some people are grateful for what they have. Irving Janis, I believe said that ‘magnanimity’ is an underrated disposition in decision making.

3. Phil says:

Dan, what do you think is wrong with 8 schools? I’m on record as saying that it contains just about everything you, or at least I, need to understand in order to know what is going on with Bayesian multilevel models. Actually I think I said everything, not ‘just about’ everything, but that’s a bit of hyperbole. In any case, any time someone asks the point of all this Bayesian stuff, that’s the example I go to. It is really easy to understand, and it contains all of the essential elements. What’s not to like? Or, maybe this is a better question: what is a _better_ example?

• Aki Vehtari says:

One thing is that 8 schools is so low dimensional that Gaussian and Student’s t are good enough as proposal distributions for importance sampling. We should have more standard examples with more than 100 dimensions.

• Phil says:

Importance sampling seems to me to be a computational detail rather than a concept one needs in order to understand what Bayesian statistics is about. Computational details are important in practice, of course, but they’re about the ‘how’, and the beauty of the 8 schools example is that it is easy to understand the ‘why’.

• Andrew says:

I agree with Phil on this. If Dan wants to hate the 8 schools example, that’s his right to hate whatever he wants.

But it seems like the consensus here is that the 8 schools example is great, but there are certain aspects of hierarchical modeling computation that it doesn’t encompass.

That’s fine but then it seems the most sensible reaction is not to hate the 8 schools but to want to supplement it with other benchmark problems. No reason to think that one example can do it all. As Aki says, we just need more examples.

• Dan Simpson says:

I’m gonna write a blog post :p

4. Z says:

I always thought it was a bit out of character for Brando to be so perfectly ripped as Stanley. Like, Stanley should be a guy who’s strong from doing his blue collar job but doesn’t actually go to the gym.

5. Corey says:

I’ve skimmed the start of the first infinitesimal jackknife paper and read the abstract of the coresets papers and it seems to meet there’s

• Corey says:

ugh, stupid phone

will finish this later

• Ryan Giordano says:

Thank you so much for the supportive post! I am consistently proud to be part of Team Broderick.

For what it’s worth, I’m not a Canadian myself, but I am married to one.

6. Keith O’Rourke says:

> There is not a single example I hate more than the 8 schools problem.
Me too – OK as a warm up example, but not much more.