Convergence Monitoring for Non-Identifiable and Non-Parametric Models

Becky Passonneau and colleagues at the Center for Computational Learning Systems (CCLS) at Columbia have been working on a project for ConEd (New York’s major electric utility) to rank structures based on vulnerability to secondary events (e.g., transformer explosions, cable meltdowns, electrical fires). They’ve been using the R implementation BayesTree of Chipman, George and McCulloch’s Bayesian Additive Regression Trees (BART).

BART is a Bayesian non-parametric method that is non-identifiable in two ways. Firstly, it is an additive tree model with a fixed number of trees, the indexes of which aren’t identified (you get the same predictions in a model swapping the order of the trees). This is the same kind of non-identifiability you get with any mixture model (additive or interpolated) with an exchangeable prior on the mixture components. Secondly, the trees themselves have varying structure over samples in terms of number of nodes and their topology (depth, branching, etc.). This is the kind of non-identifiability you get with many Bayesian non-parametric models.

The only identified parameter in a BART model is the scale parameter of the centered, normal noise distribution.

For models with unidentified parameters, how can we (a) monitor convergence, and (b) compute MCMC error in the predictions?

The goal is to support decision making, so predictive accuracy is the most important factor for the models. They’re also interested in which of the many features they’re plugging in turn out to be important.

The discussion of convergence in the Chipman et al. paper was rather vague. On p. 10, they say they iterate until they reach “satisfactory convergence”, but don’t say how they measure convergence. On p. 26, they say “to gauge MCMC convergence, we performed four independent repetitions of 250,000 MCMC iterations and obtained essentially the same results each time”, but they don’t say what a result is in this context. I’m assuming this is predictive accuracy of some sort, such as (root) mean square error or log loss.

I suggested that they monitor the linear predictors, $latex \hat{z}_n$, or the residuals $latex y_n – \mbox{logit}^{-1}(\hat{z}_n)$. Andrew suggested monitoring a “whole suite of predictive outcomes.” I took that to mean that if there are known subclasses, we could break them out, and we could also use performance on held-out data. Because the models are intended to be used predictively as classifiers to support decision making, we could also measure various kinds of loss (equivalently utility) of the predictions on held-out data. I don’t know how stable the use of features is, so it may be hard to monitor the kinds of things Chipman et al. call out in their paper (like the number of nodes in the trees that are based on a given feature).

Are there other techniques we should look at?