Bayesian Methods for Variable Selection

The following is from Marina Vannucci, who will be speaking in the Bayesian working group on Oct 26.

I will briefly review Bayesian methods for variable selection in regression settings. Variable selection pertains to situations where the aim is to model the relationship between a specific outcome and a subset of potential explanatory variables and uncertainty exists on which subset to use. Variable selection methods can aid the assessment of the importance of different predictors, improve the accuracy in prediction and reduce cost in collecting future data. Bayesian methods for variable selection were proposed by George and McCulloch (JASA,1993). Brown, Vannucci and Fearn (1998, JRSSB) generalized the approach to the case of multivariate responses. The key idea of the model is to use a latent binary vector to index the different possible subsets of variables (models). Priors are then imposed on the regression parameters as well as on the set of possible models. Selection is based on the posterior model probabilities, obtained, in principle, by Bayes theorem. When the number of possible models is too large (with p predictors there are 2^p possible subsets) Markov chain Monte Carlo (MCMC) techniques can be used as stochastic search techniques to look for models with high posterior probability. In addition to providing a selection of the variables, the Bayesian approach allows model averaging, where prediction of future values of the response variable is computed by averaging over a range of likely models.

I will describe extensions to multinomial probit models for simultaneous classification of the samples and selection of the discriminatory variables. The approach taken makes use of data augmentation in the form of latent variables, as proposed by Albert and Chib (JASA,1993). The key to this approach is to assume the existence of a continuous unobserved or latent variable underlying the observed categorical response. When the latent variable crosses a threshold, the observed category changes. A linear association is assumed between the latent response, Y, and the covariates X. I will again consider the case of a large number of predictors and applied Bayesian variable selection methods and MCMC techniques. An extra difficulty here is represented by the latent responses, which is treated as missing and imputed from marginal truncated distributions. This work in published in see Sha, Vannucci et al. (Biometrics,2004). I have put a link to the paper from the group webpage.