David Duvenaud writes:
I’ve been following your recent discussions about how an AI could do statistics [see also here]. I was especially excited about your suggestion for new statistical methods using “a language-like approach to recursively creating new models from a specified list of distributions and transformations, and an automatic approach to checking model fit.”
Your discussion of these ideas was exciting to me and my colleagues because we recently did some work taking a step in this direction, automatically searching through a grammar over Gaussian process regression models.
Roger Grosse previously did the same thing, but over matrix decomposition models using held-out predictive likelihood to check model fit.
These are both examples of automatic Bayesian model-building by a search over more and more complex models, as you suggested. One nice thing is that both grammars include lots of standard models for free, and they seem to work pretty well, although the search is of course computationally expensive.
The ubiquitous Josh Tenenbaum adds:
Just to chime in one point here: these methods might seem “computationally expensive” as David says, perhaps compared to what people are used to when they build or fit only one or a small number of models. But when you consider the size and scope of the space of models that is searched, and the fact that all steps of model construction, evaluation and search are automatic, it doesn’t seem like such an expensive process. In my experience, working statisticians, machine learners, and data scientists rarely if ever explore such a space so systematically in large part because it seems impractically expensive to do so (in terms of both their own time and computation time, as well as perhaps other scarce resources). Of course the “AI” in our work is still quite primitive and naive, both in terms of good modeling methods as you have developed and taught, and in terms of human intelligence more generally. And the space of models we can consider automatically is still quite limited compared to what humans can do. There is a lot to improve on here. But in evaluating the limitations of these methods, and prospects for future work of this sort, I think other factors might loom larger than computational efficiency.
Here’s the abstract of their paper:
Despite its importance, choosing the structural form of the kernel in nonparametric regression remains a black art. We define a space of kernel structures which are built compositionally by adding and multiplying a small number of base kernels. We present a method for searching over this space of structures which mirrors the scientific discovery process. The learned structures can often decompose functions into interpretable components and enable long-range extrapolation on time-series datasets. Our structure search method outperforms many widely used kernels and kernel combination methods on a variety of prediction tasks.
I can’t comment on the details, especially as this sort of predictive regression problem isn’t the thing I typically work on, but I like the general idea of constructing models through some sort of generative grammar. It seems to me a big step forward from the previous graphical-model paradigm in which the model is a static mixture of a bunch of conditional independence structures on a fixed set of variables. As I’ve written many times (for example, with Shalizi in our instant-classic paper, rejoinder here), I think discrete Bayesian model averaging is a poor model for science and a poor model for statistical inference. This open-ended approach smells right to me. It’s possible that all that horrible graphical model-averaging stuff was a necessary stage that statisticians and cognitive scientists had to go through on the way to models of generative grammar.
I feel so lucky to be around during this exciting era. Imagine being stuck with formalisms such as Wald’s and Savage’s hopeless attempts to shoehorn statistical reasoning into the formats of decision theory and game theory. Those guys were brilliant but they just didn’t have the tools to do the job. Not that I think today’s researchers have the last word, by any means, but it’s so satisfying to see forward motion in modeling, computing, and also conceptual frameworks.
P.S. Andrew Wilson writes:
Readers may be interested in our closely related work (with Ryan P. Adams), introducing covariance kernels which enable automatic pattern discovery and extrapolation with Gaussian processes. We discuss some of the initial motivations for machine learning, and AI inspired statistics. Our method is computationally simple (comparable to using standard smoothing kernels), and is grounded in modelling a spectral density with a Gaussian mixture. With enough components in the mixture, we can approximate any spectral density (and thus any stationary covariance kernel) with arbitrary accuracy. We show that the proposed method can automatically discover complex structure and extrapolate over long ranges. However, our approach to automatic structure discovery is fundamentally different from the discussed “grammar of kernels” approach and previous related kernel composition approaches.