Pablo Alcain writes:
I wanted to know your thoughts regarding Gaussian Processes as Bayesian Models. For what it’s worth, here are mine:
What draws me the most to Bayesian inference is that it’s a framework in which the statistical modeling fits very nicely. Coming from a natural science background (Physics), the interpretability of the results for me is tightly related to the modeling itself. Gaussian processes (or, for what it’s worth, any non-parametric model) tend to defeat that purpose. Of course, there are some mild interpretation one can do (like “characteristic covariation length” in quadratic kernels), but somehow doesn’t feel quite enough. It’s no surprise for me, from this perspective, that they are so widely used in hyperparameter tuning in ML models, where it’s very hard to have intuitions about the underlying processes that generate the distribution for the regression. McElreath also gives an example I enjoy in his book, when he uses GP to model correlation in tools production in the Pacific islands, depending on the distance that the islands have. I think that they are a super powerful interpolation technique, with much more niceties than splines, for example. But using them for any regression, to me, feels kind of anti-Bayesian even though it’s formally in the prior-data-posterior loop.
I find them also pretty misleading when you have a high-noise setting, and find very hard to figure out whether they are overfitting or not; but probably this last one is because of my super shallow experience with them.
Anyway, just wanted to know what you think about GP in general. I tried to find mentions to it in the statmodeling blog, and I couldn’t find many posts in the recent years. I found however this comment from you: “Yes, we’ve found that when fitting Gaussian processes, it’s helpful to use strong prior distributions on the hyperparameters.”
My reply: I’ve only actually fit Gaussian processes twice. The first time I didn’t even fit the model, it was Aki who did it, so I’m only claiming (partial) credit for it because it’s in BDA3 and I looked over the results. The second time was as an exercise to predict future elections, and that’s where I learned the lesson about needing to specify the length scale or use a strong prior distribution for it.
Here’s what I wrote before, in the context of the election model:
For Gaussian processes it can be tricky to estimate length-scale parameters without including some regularization. In this case I played around with a few options and ended up modeling each state and each region as the sum of two Gaussian processes, which meant I needed short and long length scales. After some fitting, I ended up just hard-coding these at 8 and 3, that is, 32 years and 12 years. The long scale is there to allow low-frequency trends which will stop the model from automatically regressing to the mean when extrapolated into the future; the short scale fits the curves you can see in the data. . . .
The model is complicated in a statistical sense in that it has state, regional, and national levels; but it’s “dumb” in that it uses nothing more than past vote totals and a forecast of the 2016 vote; this model does not account for demographic trends. So, for example, our model does not “know” that New Mexico has so many Latinos; it just sees that New Mexico has gradually been trending Democratic in recent decades. And our model does not “know” to associate Mitt Romney’s religion with the Republican spike in Utah in 2012. And so on. So in that sense we see this model as a baseline, a sort of blank slate or minimal starting point upon which further information can be added.
So, to answer Alcain’s question: I do think these models are Bayesian, even if only in the “default Bayesian” setting. Remember, there are two ways of thinking about a prior:
1. p(theta), the marginal distribution for the parameters in your model or, equivalently, a generative model for theta.
2. p(theta | y_previous), the state of knowledge about theta given previously-available information or, more generally, information external to that provided by the data in your current model.
In that second formulation, we can reformulate the prior as p(theta | y_previous) proportional to p(theta) * product_j p(theta | y_previous_j), including external data sources j=1,…,J. When thinking about it this way, you can see how it can make sense to include multiple “priors” in a single model, where each “prior” or line of Stan code corresponds to an approximation to the likelihood of theta given a different data source.
So, now, back to Gaussian processes: if you think of a Gaussian process as a background prior representing some weak expectations of smoothness, then it can be your starting point. Set up a model with a Gaussian process prior and then add more prior information as appropriate.
A very interesting post. I can’t believe no one else has commented but perhaps there is web site issue. In any case, I have heard similar concerns with respect to Bayesian Additive Regression Trees (BART) that is another popular BNP method. To me, it just depends on what you mean by Bayesianism. It is just the ability to create informative priors? Or is it the wide span of Bayes’ Theorem and all of the modern machinery that we love to hate like MCMC? I would say that the ability to add prior information is a very important part of Bayesianism, but it is only one of many parts. And how this particular part is applicable to BNP is an intriguing area of future research.
The idea that you need to argue bayesian vs frequentist is pretty much dead. It is recognized as a matter of practicality now. Which it always was, since one is a computationally efficient approximation of the other that works for some simple examples.
It is not a question of frequentist vs. Bayesian. This is parametric Bayesian vs. BNP, i.e., where informative prior information can easily be incorporated vs. where it is readily apparent how to do so.
For your second application, how did the model do at predicting 2020?
Alex:
It didn’t do particularly well or poorly. As discussed in the linked post, the model had wide intervals, which makes sense given that there had been wide vote swings in the elections that were used to fit it. The model was generic, with no information beyond past elections, so there’s no way it could predict future elections accurately; all it could do was give a general sense of possibilities.
Can someone expand on the how the factorization in the prior given previous data comes about?
p(theta | y_previous) proportional to p(theta) * product_j p(theta | y_previous_j)
I interpret this as coming from Bayes rule: p(theta | y_previous) proportional to p(theta) * p(y_previous | theta)
Does this the assumes conditional independence of previous data given theta, i.e., p(y_previous | theta) = product_j p(y_previous_j | theta) ? If so, can someone help filling the missing steps. I am not seeing it yet. Maybe I need a lot more coffee :)
Any help would be greatly appreciated.
I feel like the most interesting feature of Gaussian processes lies in the ability to design a kernel that’s “just right for the job”. The example in BDA3 is a nice one. Unfortunately designing kernels can be sufficiently challenging that in practice I feel like a lot of the time we end up defaulting to a squared exponential kernel for want of anything better. Also, there’s a real tension between exact GP with unacceptable computational cost for > 5,000 datapoints vs approximate GPs that scale well but lose a little (or a lot, depending on the approximation technique) in terms of performance.