Becky Passonneau and colleagues at the Center for Computational Learning Systems (CCLS) at Columbia have been working on a project for ConEd (New York’s major electric utility) to rank structures based on vulnerability to secondary events (e.g., transformer explosions, cable meltdowns, electrical fires). They’ve been using the R implementation BayesTree of Chipman, George and McCulloch’s Bayesian Additive Regression Trees (BART).

BART is a Bayesian non-parametric method that is non-identifiable in two ways. Firstly, it is an additive tree model with a fixed number of trees, the indexes of which aren’t identified (you get the same predictions in a model swapping the order of the trees). This is the same kind of non-identifiability you get with any mixture model (additive or interpolated) with an exchangeable prior on the mixture components. Secondly, the trees themselves have varying structure over samples in terms of number of nodes and their topology (depth, branching, etc.). This is the kind of non-identifiability you get with many Bayesian non-parametric models.

The only identified parameter in a BART model is the scale parameter of the centered, normal noise distribution.

For models with unidentified parameters, how can we (a) monitor convergence, and (b) compute MCMC error in the predictions?

The goal is to support decision making, so predictive accuracy is the most important factor for the models. They’re also interested in which of the many features they’re plugging in turn out to be important.

The discussion of convergence in the Chipman et al. paper was rather vague. On p. 10, they say they iterate until they reach “satisfactory convergence”, but don’t say how they measure convergence. On p. 26, they say “to gauge MCMC convergence, we performed four independent repetitions of 250,000 MCMC iterations and obtained essentially the same results each time”, but they don’t say what a result is in this context. I’m assuming this is predictive accuracy of some sort, such as (root) mean square error or log loss.

I suggested that they monitor the linear predictors, , or the residuals . Andrew suggested monitoring a “whole suite of predictive outcomes.” I took that to mean that if there are known subclasses, we could break them out, and we could also use performance on held-out data. Because the models are intended to be used predictively as classifiers to support decision making, we could also measure various kinds of loss (equivalently utility) of the predictions on held-out data. I don’t know how stable the use of features is, so it may be hard to monitor the kinds of things Chipman et al. call out in their paper (like the number of nodes in the trees that are based on a given feature).

Are there other techniques we should look at?

Good to see something with some statistical content for a change. I was starting to get sick of all this political blathering!

You can always take a look at the joint log-likelihood of the entire model. Typically, with gigantic models like this, you won’t see your MCMC jump modes so you’ll see the joint log-likelihood of the model increase as the number of samples increases and then level off.

I would think one needs to figure out ways to compare structured models (such as trees) to deal with the non-identifiability concerning “the trees themselves have varying structure over samples in terms of number of nodes and their topology (depth, branching, etc.)”; it appears to be harder to deal with the first non-identifiability “that of the indexes of which aren’t identified ” — a brute force approach would consider combinations or all possibilities or something similar and this would be computationally VERY expensive.

Griffiths and Steyvers suggested a greedy approach to align mixture components across samples for latent Dirichlet allocation (LDA), which has non-identifiable indexes. When you compare samples, you can usually align the top topics (multinomial mixture components) pretty well, but the tail gets much noisier. By these measures, though, things don’t change much, whereas overall log probability (or log likelihood) do seem more stable.

My guess is they monitor the integrated log likelihood for a tree. The original paper that proposed MCMC for CART models is chipman et al 1998, a JASA paper. In that they derive the integrated log likelihoods for both a classification and for regressions trees with one common variance and for regression trees with differing variances. The other good paper on this subject is the followup paper, priors for bayesian cart shrinkage which describes the variance penalizing shrinkage prior they incorporate into the BART algorithm.

Regarding the multiple mode jumping, you could try simulated tempering or parallel tempering but this is really more work than necessary. The paper The multiset sampler Leman, et al -again JASA- uses multisets on trees to allow for a sort many trees at once technique. In my own work I’ve found this to be a useful algorithm for moving over multiple modes in both parametric bayesian models (like ZIP models) and in nonparametric bayes models like Bayesian CART.

One difficulty youh have with these models and part of the subject of my phd thesis is with large data sets this method really gets bogged down trying to sample over the space effectively. So if you have a bug data set -I’m guessing you do- you may also have this problem. If you want to know what I do to alleviate this you can shoot me an email to discuss or wait another year for this part of my thesis.

Authors generally have been silent on the point of diagnosing or monitoring convergence with these models. An unfortunate issue when dealing with diagnosing convergence but fortunately this leaves room for some contributions.

[…] Convergence monitoring for non-identifiable and non-parametric models […]

[…] Convergence Monitoring for Non-Identifiable and Non-Parametric … […]