Here’s an interesting classification (from John Langford) of statistical methods from a machine learning perspective. The only thing that bothers me is the conflation of statistical principles with computational methods. For example, the table lists “Bayesian learning,” “graphical/generative models,” and “gradient descent” (among others) as separate methods. But gradient descent is an optimization algorithm, while Bayesian decision analysis is an approach that tells you what to optimize (given assumptions). And I don’t see how he can distiguish graphical/generative models from “pure Bayesian systems.” Bayesian models are almost always generative, no? This must be a language difference in CS versus statistics.

Anyway, it’s an interesting set of comparisons. And, as Aleks points out, we’re trying our best to reduce the difficulties of the Bayesian approach (in particular, the difficulty of setting up models that are structured enough to learn from the data but weak enough to learn from the data).

I completely agree. That is the same reaction I had when reading John's post. And I am in CS :)

"that are structured enough to learn from the data but weak enough to learn from the data"

;-)

BTW, I'm more on the CS side of the force, and I agree that his classification is somewhat strange, putting theorical frameworks and optimisation tools on the same level.

Maybe you could put a comment on his blog to raise this issue?

I also concur. I come from the statistics side but am quite familiar with CS methods. Statistical principles are really a layer above the algorithms. The good news is that there is so much statisticians can contribute to this area. In the meantime, a lot is being re-invented.

The issue is framed quite nicely in the conference announcement sent by Aleks to this blog the other day. From my conference experience, a typical machine learning paper starts by defining a problem, then stating the assumptions, then describing an algorithm, then showing the results from one or two case studies and finally announcing the success of the method.

Why should the problem be thus defined?

How can the assumptions be checked?

Did the assumptions drive the algorithm or did the algorithm drive the assumptions?

How are the algorithms judged? Are they optimal? Do they even converge? What is the complexity?

How are we to know that the algorithm works in general? Beyond the specifics of the case studies?

Or, what class of problems would this algorithm work?

Are there formal proofs of effectiveness? Do computational examples constitute proof?

Lots of questions but few answers in this fast-developing field.

A big area in machine learning is Gaussian process methods, which are Bayesian but not (fully) generative: the data is assumed given, known and fixed, and the labels are assumed to be generated as a function of x plus noise.

Gaussian processes provide a way of specifying probability distributions over functions. I'd say they are fully generative. Even though the things they generate (functions) do not have finite representations.

The discriminative use of Gaussian processes you describe might be more common.

But nothing is preventing you from having a generative model of the data.