How good is the Bayes posterior for prediction really?

It might not be common courtesy of this blog to make comments on a very-recently-arxiv-ed paper. But I have seen two copies of this paper entitled “how good is the Bayes posterior in deep neural networks really” left on the tray of the department printer during the past weekend, so I cannot underestimate the popularity of the work.

So, how good is the Bayes posterior in deep neural networks really, especially when it is inaccurate or bogus?

The paper argues that in a deep neural network, for prediction purposes, the full posterior yields a worse accuracy/cross-extropy than the one from the point estimation procedures such as stochastic gradient descent (sgd). Even more strikingly, it claims that a quick remedy, no matter how ad hoc it may look like, is to reshape the posterior via a power transformation, which they call a “cold posterior” $latex \propto (p(\theta|y))^{1/T}$ for temperature T<1.  Effectively the cold temperature concentrates the posterior density more around the MAP. According to the empirical evaluation the authors claim the cold posterior is superior to both the exact posterior and the point estimation in terms of predictive performance.

First of all, I should congratulate the authors for their new empirical results on Bayesian deep learning, a field that seems to have been advocated back to a few decades ago but neither sophisticatedly defended nor comprehensively applied, at least for modern-day deep models. Presumably the largest barrier is computation, as the exact sampling from a posterior in a deep net is infeasible given today’s computer.

Indeed, even in this paper, it is hard for me to tell, if the undesired performance of “exact” Bayesian posterior prediction is attributed to Thomas Bayes, or to Paul Langevin: the authors use an overdamped Langevin dynamic to sample from the posterior, except for omitting the Metropolis adjustment. They do report some diagnostics, but not quite exhaustive, as a more apparent and arguably more powerful test is to run multiple chains and see if they have mixed. Later in their section 4.4, the authors seem not to distinguish between the Monte Carlo error and the sampling error, and suggest even the first term can to too large in the method, which makes me even more curious how reliable the “posterior” is. Further, according to figure 4, there is an obvious discrepancy between the samples (over all temperatures) drawn from HMC and the Langevin dynamic used here even in a toy example with network depth to be 2 or 3. I don’t think HMC itself is necessarily the gold standard either:  In a complicated model such as ResNet, HMC suffers from all multimodality and non-log-convexity, and would hardly mix. So are we just blaming the Bayesian posterior from some points that were drawn from a theoretically-biased and practically-hardly-converged sampler?

In section 5, the authors conjecture that deep learning practice violates the likelihood principle because of some computation techniques such as dropout. To me, this is not what the likelihood principle is particularly relevant. Rather it is another reason why the posterior from the proposed sampler is computationally concerning.

To be fair, for the purpose of point estimation, even sgd is not necessarily guaranteed to either theoretically converge to, nor practically well approximate the global optimum either, while in most empirical studies it still yields reasonable predictions.  This reminds me of a relevant paragraph by Gelman and Robert (2013):

In any case, no serious scientist can be interested in bogus arguments (except, perhaps, as a teaching tool or as a way to understand how intelligent and well-informed people can make evident mistakes, as discussed in chapter 3 of Gelman et al. 2008). What is perhaps more interesting is the presumed association between Bayes and bogosity. We suspect that it is Bayesians’ openness to making assumptions that makes their work a particular target, along with (some) Bayesians’ intemperate rhetoric about optimality.

Sure, I guess it might be also unfair that we are kinda rewarding Bayes for it is more likely to produce fragile and bogus computational results. That said, the discussion here does not dismiss the value of this new paper, in which part of the merit is to alert Bayesian deep learning researchers that many otherwise fine sampler may produce inaccurate or bogus posteriors in these deep models, and all these computation errors should be taken into account for prediction evaluations.

But the intemperate rhetoric about optimality is real

The discussion below is not directly related to that paper. But an even more alarming and less understood question is, how good is the predictive performance of Bayes posterior in a general model when it is computationally accurate?  

As far as I know, Bayes procedures do not necessarily automatically improve the prediction or calibration over a point estimation such as the MAP.

I collected a few paradoxical examples when I wrote our old paper on variational inference diagnostics. Without the need for a residual net, even in a linear regression I can find examples in which exact Bayesian posterior lead to worse predictive performance than ADVI (Reproducible code is available upon request). This is not a pathological edge example. The data is simulated from the correctly-specified regression model with n=100 and d=20 — and these are exactly the data one would simulate for linear regressions. The posterior is sampled using stan and is exact measured by all diagnostics. The predictive performance is evaluated using log predictive density on independent test data and the averaged over a large number of replications of both data and sampling to eliminate all other noises. But still, log predictive density from ADVI is higher than the exact posterior.

In this experiment, I also examined that ADVI has a large discrepancy compared with the exact one, revealed by a large k hat from our psis diagnostics. So basically it is a somewhat “cold-posterior” in terms of underdispersion.

At the immediate level,  the underdispersion from variational inference can serve as an implicit prior which might render the model more regularization and therefore improve the prediction. For the record, I already encode an N(0,2) prior on all regression coefficient in that example with all unit inputs. In general, however, many complicated models used in practice lack informative priors, and it could be the reason why a stronger regularization, or via a “colder” posterior could help– although it is more suggested to come with a reasonable informative prior directly rather than tune the “temperature” for the sake of both interpretation and a coherent workflow.

Secondly, a model/regularization good for point estimation,  is not necessarily good for Bayesian posteriors. We can recall a Bayesian lasso gives inferior performance than the horseshoe, though it is often what is needed for point estimations. This is also the example in which the regularization effect from prior cannot be simplified by a temperature rescaling as the horseshoe has both thicker tail and thicker zero than Laplace prior. In that deep learning context, does the network architecture and all implicit regularizations such as dropout that were motivated from, designed for, and often cases optimally-tuned towards MAP estimates necessarily good/enough for posteriors? We do not know.

Coincidentally,  Andrew and I recently wrote a paper on Holes in Bayesian Statistics:

It is a fundamental principle of Bayesian inference that statistical procedures do not apply universally; rather, they are optimal only when averaging over the prior distribution. This implies a proof-by-contradiction sort of logic … it should be possible to deduce properties of the prior based on understanding of the range of applicability of a method.

In short, it is not too surprising that the exact Bayesian posterior can give an inferior prediction than MAP or VI, if we are merely using a black-box model and treat it as it is. But,

This does not mean that we think Bayesian inference is a bad idea, but it does mean that there is a tension between Bayesian logic and Bayesian workflow which we believe can only be resolved by considering Bayesian logic as a tool, a way of revealing inevitable misfits and incoherences in our model assumptions, rather than as an end in itself.

 

P.S. (from Andrew): Yuling pointed me to the above post, and I just wanted to add that, yes, I do sometimes encounter problems where the posterior mode estimate makes more sense than the full posterior. See, for example, section 3.3 of Bayesian model-building by pure thought, from 1996, which is one of my favorite articles.

As Yuling says, the full Bayes posterior is the right answer if the model is correct—but the model isn’t ever correct. So it’s an interesting general question: when is the posterior dominated by a mode-based approximation? I don’t have a great answer to this one.

Another good point made by Yuling is that “the posterior” isn’t always so clearly defined, in that, in a multimodal setting, the computed posterior is not necessarily the same as the mathematical posterior from the model. Similarly, in a multimodal distribution, “the mode” isn’t so clearly defined either. We should finish our paper on stacking for multimodal posteriors.

Lost of good question shere, all of which are worth thinking about in an open-minded way, rather than as a Bayes-is-good / Bayes-is-bad battle. We’re trying to do our part here!

25 thoughts on “How good is the Bayes posterior for prediction really?

  1. I think they clearly show that they are using a bad prior (sections 4.2, 5.2 and appendix K). Annoyingly they don’t mention that in the abstract or in the introduction. First I thought they are unaware of MacKay’s and Neal’s work in 1990’s, but in Appendix K they cite Neal and then ignore his work and don’t consider testing priors with hyperparameters but test only fixed priors. Specifically, using the same prior scale for all weight layers is a big problem as it doesn’t work well even with just one hidden layer network. It’s not sufficient to take into account just the fan-out scaling. The paper is useful as a baseline, but they could do much better. The computation for deeper networks is more difficult, but it hasn’t been shown yet that the hierarchical priors would not work reasonably if the computation works.

  2. From the “holes” paper: “ If classical probability theory needs to be generalized to apply to quantum mechanics, then it makes us wonder if it should be generalized for applications in political science, economics, psychometrics, astronomy, and so forth.”

    I think Betteridge’s law of headlines is applicable for the questions in the paragraph following that quote. I doubt that the recent discussion about elegibility, for example, would be helped by considering a superposition of Democratic candidates.

    Anyway, people interested in a Bayesian inference treatment of quantum mechanics might find this paper interesting: “The Entropic Dynamics approach to Quantum Mechanics” by Ariel Caticha, https://www.mdpi.com/1099-4300/21/10/943

      • On the first section about quantum probability, there is actually a set of work using quantum probability models to explain human judgment and especially context effects that are not easily accounted for by classical probability as your article describes. Some big names in that field are Jerome Busemeyer, Jennifer Trueblood, and Emmanuel Pothos, though there are certainly many others working in that vein (Peter Bruza is another name that comes to mind; for the record, this is just off the top of my head, so apologies to anyone I am unfairly leaving out).

        Anyway, just to show that people really are taking that issue seriously and using it productively to build different types of useful models for “macro” systems.

        • It’s funny that physicists would love to get rid of all that quantum weirdness and psychologists feel they need. I’m sure they’re having a lot of fun, though.

        • typo: “psychologists feel they need it.”

          By the way, maybe this blog has conditioned me into thinking than anything published in the Proceedings of the National Academy of Sciences is junk science but it’s hard to take seriously something titled “Context effects produced by question orders reveal quantum nature of human judgments”: https://www.pnas.org/content/111/26/9431

        • Carlos:

          From the abstract of the linked paper, I see this:

          However, quantum theory makes a universal, nonparametric prediction for differing outcomes when two successive questions (e.g., attitude judgments) are asked in different orders. Quite remarkably, this prediction was strongly upheld in 70 national surveys carried out over the last decade (and in two laboratory experiments) and is not one derivable by any known cognitive constraints. The findings lend strong support to the idea that human decision making may be based on quantum probability.

          I guess I’ll have to read the whole paper . . . but, based on the abstract, it looks like a lot of hype and B.S. Their pet theory was “strongly upheld” 70 times, huh? Quite remarkable, indeed.

        • Interesting discussion I ran into. I don’t know enough about these topics but I thought the original article is thought provoking. It’s applying a new type of probability theory to model human cognition and behavior. This is really nothing so different from what we have been doing in the SBS for decades, but in the latter, only with classical probability, or to most researchers, only some simpler statistical tools. So I found this is quite interesting and plan to read more about it.

          Anyway, as a person learning/working in the fields of SBS, I would recommend you reading the short paper and related background (as it seems you know little about SBS research) before making comments. Don’t judge a book by its cover, especially if it is not in your field or not what you normally like. Sure. Maybe that takes too much trouble compared to writing this arragont comment–sorry, but that’s what I strongly felt. You can just belittle all those who are trying to find patterns in human minds and behaviors, and say, well, we can never understand these things as they are not simple particles in the end. Wait. Maybe you indeed believe we are just a bunch of tiny particles. Well. I don’t believe that.

        • Student:

          Regarding, “Don’t judge a book by its cover,” I’m not the one who wrote that title and that abstract. Authors are responsible for their titles and abstracts. I’m pretty sure that lots more people read the title and abstract than read the whole paper. Authors have the right to put whatever they want in their titles and abstracts, and readers such as myself have every right to criticize those titles and abstracts.

          If there’s useful stuff in this research, that’s great. It wouldn’t be the first time that a good idea is presented with some hype.

        • Definitely agree on the goofy title, though I think a lot of the goofiness comes from the inadvertent association with “quantum consciousness” and other such mumbo-jumbo. Really they’re just using a different way of representing and updating probabilities, so it’s just the mathematical formalism that is “quantum”.

          As Lakeland says below, chances are you could come up with a classical model too, the question is whether any nonlocal terms would be hard to explain/justify; quantum probability might well be a more elegant description, but in the end it’s still just a description.

        • Too many physicists have swallowed the kool aid…

          The Bell inequalities were almost universally taken to mean that no hidden variable theories were possible. But Bell himself didn’t think that. His theory actually means that *no local hidden variable theories* are possible. He himself felt that the solution was to reject locality and embrace nonlocality.

          If you stick to a non-relativistic theory, the problem is “solved”, in that quantum mechanics is not probabilistic at all, it is a deterministic nonlocal hidden variable process described by Bohm’s equations.

          Of course, relativistic theories require more work, but no-one is putting in that work, because they’ve all swallowed the weirdness kool-aid.

          There is a definite experiment which it would be very good to do: send three spacecraft into space… In the center one, entangle two photons and fire them off to each of the other spaceraft in opposite directions. In those spacecraft have some pseudo-random code that switches the detector polarization constantly… have the two detectors move farther and farther away from the source, and see if the entanglement effect goes away at some radius. This would determine if there is a finite but faster-than-light propagation of “quantum information” which would of course appear to be “nonlocality”.

        • Hi Daniel (and anyone else into this stuff), Lee Smolin’s latest book “Einstein’s Unfinished Revolution” has an interesting take on the ‘realist’ alternative interpretations of QM starting from de Broglie and going through Bohmian approach…and ultimately why they are still pretty unsatisfying.

        • Can you give the gist of his argument on “satisfaction”? I mean, Bohm’s theory is obviously not a full and complete theory, but non-realism is like a repudiation of existence, it’s **super** unsatisfying. I’d argue we just need to accept the idea of nonlocality and start probing its implications and this is what is going to move the needle on QM…. reading the reviews on amazon one reviewer makes it sound like Smolin’s more or less on the same page?

          https://www.amazon.com/gp/customer-reviews/R3CYDETEQAICUG/ref=cm_cr_dp_d_rvw_ttl?ie=UTF8&ASIN=B07FLK72XC

  3. A quick terminological question: what is “cross-extropy” (this one may just be a typo, but the ‘x’ key is far from the ‘n’ key so I’m not sure)? Google didn’t help.

  4. The key is, what does it mean for the Posterior to be good? The Bayesian posterior tells you which regions of parameter space are not ruled out by your assumptions given your data. If your likelihood rates outcomes in a reasonable way and your prior rates assumptions about parameter values in a reasonable way, then your posterior rates parameter values that could have led to your data under your assumptions… you can go wrong with bad assumptions, you can go wrong with bad sampling/computation, but you aren’t going to go wrong with prediction per se. if you get bad predictions it will be a consequence of something going wrong at the earlier stages

  5. From Troubling Trends in Machine Learning Scholarship

    “Recently, Melis et al. [54] demonstrated that a series of published improvements, originally attributed to complex innovations in network architectures, were actually due to better hyper- parameter tuning. On equal footing, vanilla LSTMs, hardly modified since 1997 [32], topped the leaderboard. The community may have benefited more by learning the details of the hyper-parameter tuning without the distractions. Similar evaluation issues have been observed for deep reinforcement learning [30] and generative adversarial networks [51]. See [68] for more discussion of lapses in empirical rigor and resulting consequences.”

    I don’t know for certain, but it feels like that paper is doing a fairly standard practice: demonstrate the superiority of your well-tuned pet solution by a favorable comparison to a standard approach done poorly.

    On the linear regression example where variational point estimates yield higher log posterior density, do they also yield lower OOS MSE? I wonder if it has something to do with the measure of performance one optimizes for.

    • “I don’t know for certain, but it feels like that paper is doing a fairly standard practice: demonstrate the superiority of your well-tuned pet solution by a favorable comparison to a standard approach done poorly.”
      Are you still referring to the paper that is the subject of this blog post, or another paper?

      Our paper is not trying to prescribe a specific approach. Rather it is trying to get to the bottom of a phenomenon that we felt is underappreciated (at least in the deep learning community). I do believe the hyperparameters of all algorithms were carefully tuned according to the exact same procedure for all methods.

  6. General:

    To those who run the website – it is not possible to comment on mobile for me, browsing in Chrome with an android phone, as when I type the comment box immediately becomes large, making the submit comment button impossible to click. This applies to both top level comments and replies to other comments.

    Substance:

    It’s important to note that this paper is specifically analyzing the performance of cooled priors in the context of ensemble learning. Since the effect of the prior is to smooth out disagreements between base level models while ensembles often benefit from disagreements between base level models, this may be responsible for the effect. There are also complications specific to neural networks. For example, French 1993 argues that “sharpened” base level models that are more particularized by the data result in ensembles that are less vulnerable to catastrophic forgetting.

  7. I have a question about the following wording in the blog post:

    > The data is simulated from the correctly-specified regression model with n=100 and d=20

    That VI can be more accurate when the model is mis-specified, does not surprise me. But that it is more accurate when the model is specified correctly, is a bit distressing!

    Going to the linked paper, in figure 3 we have the mentioned phenomenon. But in section 4.2 the model is specified as Y = Bernoulli(logit^-1 (beta * X)) , with a flat prior on beta.

    My question is: how can beta be sampled from a flat prior when generating the data? It’s not even a proper prior. And, if beta is not sampled from the flat prior, is the model not *mis-*specified?

Leave a Reply

Your email address will not be published. Required fields are marked *