Model checking and model understanding in machine learning

Posted on September 4, 2012 9:03 AM by Andrew

Last month I wrote:

Computer scientists are often brilliant but they can be unfamiliar with what is done in the worlds of data collection and analysis. This goes the other way too: statisticians such as myself can look pretty awkward, reinventing (or failing to reinvent) various wheels when we write computer programs or, even worse, try to design software.Andrew MacNamara writes:

Andrew MacNamara followed up with some thoughts:

I [MacNamara] had some basic statistics training through my MBA program, after having completed an undergrad degree in computer science. Since then I’ve been very interested in learning more about statistical techniques, including things like GLM and censored data analyses as well as machine learning topics like neural nets, SVMs, etc. I began following your blog after some research into Bayesian analysis topics and I am trying to dig deeper on that side of things.

One thing I have noticed is that there seems to be a distinction between data analysis as approached from a statistical perspective (e.g., generalized linear models) versus from a computer science perspective (e.g., SVM), even if—as I understand it—mathematically some of the results/algorithms are the same. Many of the computer scientists I work with approach a data analysis problem by throwing as many ‘features’ at the model as possible, letting the computer do the work, and trying to get the best-performing model as measured by some cross-validation technique. On the other hand, when I was taught basic regression, the philosophy of approaching a problem was to try to understand the model driving the data, carefully choose explanatory variables by their real-world importance as well as their statistical significance (or lack thereof—one needs to consider why variables one thought would be significant are not!) and testing for statistical issues that are known to cause problems with models or diagnostics (e.g., outliers, leverage points, non-normal residuals, etc.).

To me, a symptom of this difference in philosophies is that the machine learning software packages I have tried do not seem to output any statistics showing the relative importance or errors of the input features like I would expect from a statistical regression package. Of course given my lack of experience I could very well just be missing something obvious.

I wonder if you’ve experienced anything similar or had any thoughts on this. It seems like in the world of “big data,” machine learning techniques and philosophies are coming to dominate some types of data analysis, and I’m concerned about my impression of the depth of understanding for the problems I’ve seen it applied to—I hope the driverless car teams can predict how their models will react to new situations!

My reply:

The big difference I’ve noticed between the two fields is that statisticians like to demonstrate our methods on new examples whereas computer scientists seem to be prefer to show better performance on benchmark problems. Both approaches to evaluation make sense in their own way; I just have the impression that stat and CS have evolved to have different priorities. To a statistician, a method is powerful when it generalizes to new situations. To a computer scientist, though, solving a new problem is no big deal—they can solve problems whenever they want, and it is through benchmarks that they can make fair comparisons.

Now to return to the original question: Yes, CS methods seem to focus on prediction while statistical methods focus on understanding. One might describe the basic approaches of different quantitative fields as follows:

Economics: identify the causal effect;

Psychology: model the underlying process;

Statistics: fit the data;

Computer science: predict.

The other issue is sample size. About ten years ago I had several meetings with a computer scientist here at Columbia who was working on interesting statistical methods. I was wondering if his methods could help on my problems, or if my methods could help on his. Unfortunately, we couldn’t see it. I was working with relatively small problems, maybe a survey with 10,000 data points, and he didn’t think his throw-everything-into-the-prediction approach would work well there. Conversely, it seemed impossible to apply my computationally-intensive hierarchical modeling methods with his huge masses of information. I still felt (and feel) that some of our ideas were transferrable to the others’ problems, but doing this transfer in either direction just seemed too difficult so we gave up.

Finally, to return to your question about checking and understanding models. I’ve long thought that machine-learning-style approaches would benefit from predictive model checking. When you see where your model doesn’t fit data, this can give a sense of how it can make sense to put in improvements. Then again, I’ve long thought that statistical model fits should be checked to data also, and a lot of statisticians (particularly Bayesians) have resisted this.

Generative, both conceptually and computationally

It’s particularly easy to check the fit of Bayesian models because they are generative, both conceptually and computationally: there is a probability model for new data, and you can (typically) just press a button to simulate from this generative model (conditional on draws from the posterior distribution of the fitted parameters).

Machine learning methods are not always generative, in which case the first step to model checking is the construction of a generative model corresponding to (or approximating) the estimation procedure.

I think some interesting work is being done in connecting these ideas, for example this paper by David Blei on posterior predictive checking for topic models.

13 thoughts on “Model checking and model understanding in machine learning”

A Legarra on September 4, 2012 9:51 AM at 9:51 am said:

One of the areas where statistical science is heavily used for prediction is animal and plant breeding, where we forecast “how good the offspring of a set of individuals (selected varieties of plants, bulls) will be”. Our jargon is “prediction” (hence BLUP).

Classically this was extremely parametric, I’d say mostly Bayesian (some disagree) with a large (ab)use of multivariate normality. Size of the problems is typically very large (from 10,000 to 100,000,000).

Last years we have these DNA chips (SNP chips) with tens to hundreds of thousands of explanatory variables and, in front of them, tens of thousands of records. We predict using thses chips, this is known as genomic selection or genomic evaluation or prediction.

Interestingly, *both* Bayesian models and CS stuff perform in practice almost equally well but, *in my opinion*, the community perceives statistical (Bayesian) models as more useful as they can be interpreted, “recycled” and extended upon i.e., to new problems. Also we can carry on old theory to new “data” like DNA. We have thus understanding *and* prediction.

CS tools need somehow to be re-invented or tailored at each new problem. They are though very useful for problems where we don’t have any clue.
- Andrew on September 4, 2012 12:55 PM at 12:55 pm said:
  
  A:
  
  This is not to disagree with anything you wrote above, but I think “Blup” is a horrible name; see here.
  - A Legarra on September 5, 2012 11:21 AM at 11:21 am said:
    
    A:
    
    I know, but when Henderson came up with it in the 60’s-70’s he could definitely not present it as Bayesian or hierarchical modeling.
    
    OTOH the Best, Linear, Unbiased, Prediction have a well defined meaning in our jargon: it does predict in a linear manner (so it is easy to compute), it is best (in the sense of some optimality) and unbiased, where unbiased has large practical implications: means, in practice, that some AI studs or countries selling bulls will not be systematically favored in the predictions throughout successive years of predictions.
  - John Mashey on September 5, 2012 12:16 PM at 12:16 pm said:
    
    I disagree.
    “Blup” Googles well (even with hit for Blup Blup, so it is a fine name. It is hard to very find 4-letter names that have few other uses, are pronounceable (not just spellable), are only one syllable (and thus not prone to contraction), are not trademarked, and (presumably) have no unfortunate meaning in some language one cares about. (I haven’t checked the last out.)
    
    This is very good marketing, whoever came up with this would be welcome on product naming committees. It is a good strategy, not to use a phrase and then make an acronym, but find a good acronym and then create a phrase that somehow fits.
    
    “Hierarchical modeling”: 8 syllables, no simple contraction, obviously created by techies, although not as bad as “Silicon Graphics Origin 2000”, 11 syllables, which customers often called an SGI O2000 or SGI O2K.
    
    Short, pithy names work. For instance, people have used all sorts of names for large data sets, but it appears that the simple phrase “big data” has gotten popular. For part of that history, see Origins in Big Data: Talk.
- K? O'Rourke on September 4, 2012 1:49 PM at 1:49 pm said:
  
  > they can be interpreted, “recycled” and extended upon i.e., to new problems
  
  I think that’s due to there being a greater emphasis in stats on understanding uncertainly/variation or at least having representations (preferably mathematical)for it.
  
  In CS its likely good enough to just perform well against benchmarks?
  
  And Bayesian models being generative, both conceptually and computationally, are perhaps the easiest to understand as the representation is constructive or machine like:
  generate unkowns and possible knowns from a joint distribution and only keep where possible knowns = knowns in the current application and work just with that (conditional) distribution.
Thomas Wiecki on September 4, 2012 12:05 PM at 12:05 pm said:

In regard to using Bayesian methods on large data sets I found this paper to be very interesting:

S. Ahn, A. Korattikara and M. Welling (2012)
Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring
ICML 2012 http://www.ics.uci.edu/~welling/publications/papers/SGFS_v10_final.pdf

It shows (among other things) that the gradient can be computed over a subset of the data (i.e. using minibatches).
Wonks Anonymous on September 4, 2012 12:33 PM at 12:33 pm said:

Greg Cochran has an amusing anecdote about someone reinventing the wheel while working outside their normal field of expertise:
http://westhunt.wordpress.com/2012/09/02/my-old-boss/
Tim on September 4, 2012 1:10 PM at 1:10 pm said:

Tom Minka, Michael I. Jordan, and a number of other “machine learning” people have pushed very hard in the direction of statistically principled, generative model-based approaches for so-called “Big Data”. Perhaps the greatest successes have been in latent Dirichlet allocation and topic models for natural language, but the field of Bayesian nonparametrics seems to be pushed along in equal parts by statisticians and EECS/ML people, which is *awesome*.

Anyways, one of Minka’s papers is illustrative: http://research.microsoft.com/en-us/um/people/minka/papers/minka-discriminative.pdf

Another, from Bishop: http://research.microsoft.com/en-us/um/people/cmbishop/downloads/bishop-valencia-07.pdf

A more recent paper, from Jordan, on very flexible, conjugate priors for sparse empirical Bayes “lasso”-flavored models: http://jmlr.csail.mit.edu/papers/volume13/zhang12b/zhang12b.pdf (I met a fellow from Michigan State at the IMS meeting this year who informed me that there are several distributions conjugate to exponential power distributions; the paper in question shows that the choice Jordan and students made is computationally tractable)

All of the above are from “machine learning” / EECS folks, so lumping everyone in with (say) Vapnik or (on the other hand) Breiman is not so cool. In most fields, generative models are preferred as a stepping stone to design further experiments, to understand pieces of the picture, and to explain results. In signal processing it might be the case that all you really want is to determine whether a handwritten digit is a “2” or a “3” with great accuracy.

Intuitively, it would seem that the former (complex problems, broken down iteratively with lots and lots of data) are proliferating, while simpler problems (discriminative, one-shot affairs) are less fertile ground for research. But, predictions are difficult, especially when they concern the future; thus I have pointed only at things currently in evidence from the “ML crowd”, which seem to signal convergence with Bayesian statistical thinking.

Been following your blog (and Stan!) too long not to try and acquit myself by posting something useful (if not tightly edited… sorry, that takes too much time).
- konrad on September 4, 2012 7:53 PM at 7:53 pm said:
  
  This sheds a lot of light on something that’s been puzzling me. Most of what I know about ML&Stats is from the generative ML stable (Minka, Jordan, Bishop, etc), so the idea that some people primarily associate ML with NNs, SVMs or other non-generative approaches is new to me – for the last decade or so those approaches have struck me as historical fossils that don’t fit well into the contemporary ML framework. When I think “non-generative” I primarily think of frequentist estimation theory, so for me the ML vs Stats associations are the other way around. (There is, of course, another important distinction within CS: those who understand the likelihood concept vs those who don’t. The latter group is surprisingly well represented in software engineering, but I’ve always assumed none of them do ML.)
  
  Does anyone know how pervasive the perception of ML as primarily focussing on non-generative approaches is in (a) the statistics and (b) the ML communities?
Bob Carpenter on September 4, 2012 2:20 PM at 2:20 pm said:

If we could get the machine learning practitioners to evaluate based on log probability rather than on 0/1 loss, then machine learning would look a whole lot more like statistics. But when applications tend to call for more discrimination than just 0/1 responses, the machine learning researchers tend to turn toward ranking-based approaches with their own evaluation metrics rather than trying to do things like calibrate probabilistic estimates.

In terms of examining features, there’s a growing machine learning literature on L1-regularized regressions (equivalently MAP estimation with a Laplace [i.e., double exponential] prior if you’re Bayesian and approximating with point estimates). A substantial body of current research (e.g., see the latest JMLR issue) is devoted to inducing sparsity by group; this is very closely related to multilevel modeling of variance. (Which is not to say I don’t agree with Tim’s post above that this isn’t the only thing machine learning researchers do.)

The other area where I’ve seen hierarchical modeling is in what the machine learning researchers have taken to calling “domain adaptation”. A domain is like a group in a multilevel model. It might be changing genre from newswire to noves for written text, or changing topic from kitchen gadgets to books in product reviews.

SVMs are very different than even closely related probabilistic models like logistic regression in that there are no probabilities involved in the SVM models — they’re purely geometric.
John Mashey on September 4, 2012 6:06 PM at 6:06 pm said:

‘One thing I have noticed is that there seems to be a distinction between data analysis as approached from a statistical perspective (e.g., generalized linear models) versus from a computer science perspective (e.g., SVM), even if—as I understand it—mathematically some of the results/algorithms are the same.’

Similar math (often with different terminology) often gets used in multiple application areas:

See The Literature On Cluster Analysis, 1978, by Blashfield and Aldenderfer. (paywall)

‘There has been an explosion of interest in cluster analysis since 1960. The “explosion” of this literature is documented through: (a) a rapid growth in the number of articles which have been published using this technique; (b) the wide range of sciences interested in clustering; (c) the large and growing number of software programs for performing cluster analysis; (d) the formation of cliques of cluster analysis users; and (e) the resulting fragmentation of terminology into jargon which restricts interdisciplinary communication. In response to the effects of this explosion, it is expected that the future literature on clustering will contain a number of attempts at consolidation. Nevertheless, the facts that cluster analysis has no scientific home, that clustering methods are not based upon a well-enunciated statistical theory and that cluster analysis is tied to the complex topic of classification means that the consolidation of this literature will be difficult.’

Often there were half a dozen different labels for equivalent methods, driven by clique formation in social networks.
That’s 30+ years old, I don’t know the extent to which this has changed. The last line of the paper says:

‘Given that a major goal of cluster analysis is to form homogenous groups, the heterogeneity of its literature is ironic.’
konrad on September 4, 2012 7:33 PM at 7:33 pm said:

Wait, did you just call me a psychologist?
Matt Bogard on September 8, 2012 2:09 AM at 2:09 am said:

I’ve often thought about/noticed MacNamara’s distinction. ( http://econometricsense.blogspot.com/2011/01/classical-statistics-vs-machine.html ) I may get a little carried away in my discussion, (looking back, I’m not sure all the dialogue about LPMs and robustness really belongs, but after all I’m an amateur applied econometrician) but I do think MacNamara’s distinction is the general idea of Leo Brieman’s Paper ‘Statistical Modeling:The Two Cultures.’ (Staristical Science, 2001).

Comments are closed.