Skip to content
 

“Laplace’s Demon: A Seminar Series about Bayesian Machine Learning at Scale” and my answers to their questions

Here’s the description of the online seminar series:

Machine learning is changing the world we live in at a break neck pace. From image recognition and generation, to the deployment of recommender systems, it seems to be breaking new ground constantly and influencing almost every aspect of our lives. In ths seminar series we ask distinguished speakers to comment on what role Bayesian statistics and Bayesian machine learning have in this rapidly changing landscape. Do we need to optimally process information or borrow strength in the big data era? Are philosophical concepts such as coherence and the likelihood principle relevant when you are running a large scale recommender system? Are variational approximations, MCMC or EP appropriate in a production environment? Can I use the propensity score and call myself a Bayesian? How can I elicit a prior over a massive dataset? Is Bayes a reasonable theory of how to be perfect but a hopeless theory of how to be good? Do we need Bayes when we can just A/B test? What combinations of pragmatism and idealism can be used to deploy Bayesian machine learning in a large scale live system? We ask Bayesian believers, Bayesian pragmatists and Bayesian sceptics to comment on all of these subjects and more.

The audience is machine learning practitioners and statisticians from academia and industry.

Here are my answers to the questions above:

Do we need to optimally process information or borrow strength in the big data era?

Big Data need Big Model. Big Data are typically convenience samples, not random samples; observational comparisons, not controlled experiments; available data, not measurements designed for a particular study. As a result, it is necessary to adjust to extrapolate from sample to population, to match treatment to control group, and to generalize from observations to underlying constructs of interest. Big Data + Big Model = expensive computation, especially given that we do not know the best model ahead of time and thus must typically fit many models to understand what can be learned from any given dataset.

My point here is that even when it seems we have “big data,” we still run into data limitations. We might have zillions of observations, but if we’re interested in predictions for next year, what’s relevant is that we only have two past years of data. Etc.

Are philosophical concepts such as coherence and the likelihood principle relevant when you are running a large scale recommender system?

Coherence and the likelihood principle are not philosophical concepts; they’re mathematical concepts! They’re properties of certain statistical methods. One reason these concepts are relevant is because it’s useful to understand where they don’t apply.

“Bayesian inference” from a fixed model (with proper prior distribution) satisfies coherence and the likelihood principle; real-world “Bayesian data analysis” does not have these properties. Coherence is destroyed by the iterative process of model building, checking, and improvement. The likelihood principle is destroyed by posterior predictive checking—as discussed in chapter 6 in BDA, the predictive distribution for new data depends on the entire generative model for the data, not just the likelihood, which is conditional on the observed data alone.

Are variational approximations, MCMC or EP appropriate in a production environment?

Yes, of course. And in a production environment we should be continually testing these using fake-data simulation.

Can I use the propensity score and call myself a Bayesian?

Sure. You’re doing Bayesian inference conditional on some data summaries.

How can I elicit a prior over a massive dataset?

The same way you elicit a data model or a statistical procedure. Work from first principles or take what was done before and modify it.

Is Bayes a reasonable theory of how to be perfect but a hopeless theory of how to be good?

No. You got it backward. Bayes is a reasonable theory of how to be good but a hopeless theory of how to be perfect.

Do we need Bayes when we can just A/B test?

Huh? “Bayes” and “A/B test” are not parallel. A/B test is a data collection scheme; Bayes is data analysis. I do think that Bayes is extremely relevant for analysis of A/B tests, because the point of an A/B test is to make a decision. Or, maybe I should say, hierarchical Bayes is extremely relevant for analysis of a series of A/B tests, because the point of a series of A/B tests is to make a series of decisions.

What combinations of pragmatism and idealism can be used to deploy Bayesian machine learning in a large scale live system?

This one I know nothing about.

3 Comments

  1. David Rohde says:

    Thanks so much for your interesting answers, we are really looking forward to your 26 August talk!

    For context:
    “Can I use the propensity score and call myself a Bayesian?”
    is a reference to https://normaldeviate.wordpress.com/2012/08/28/robins-and-wasserman-respond-to-a-nobel-prize-winner/

    “Is Bayes a reasonable theory of how to be perfect but a hopeless theory of how to be good?”
    is a reference to http://www.senns.demon.co.uk/You_may_believe_you_are_a_Bayesian.pdf

    “Do we need Bayes when we can just A/B test?”
    could be clearer. The idea is what is the trade-off between a (possibly Bayesian) modelling effort to determine if A is better than B and just trying it. Bayesian approaches to A/B testing seem like a very interesting topic however!

Leave a Reply to Andrew