Imputing categorical regressors

Posted on August 17, 2007 12:45 AM by Andrew

Jonathan writes,

A grad student, Gabriel Katz (no relation), and I are working on some MCMC code with survey data to handle misreporting issues in voting data. (An old idea of mine that I presented an early draft of at Columbia years ago). Since we have coded up the models in Bugs/JAGS, we decided we might as well try to also handle some of the missing data in covariates. We are, however, a little trouble with the imputation of missing categorical covariates. We have been trying ordered probit-logit priors (so as to avoid using the easy way out using normal priors), but the problem is that it is quite hard to get BUGS/Jags to bracket-slice for certain categories of the ordered variables. The problem seems to be in choosing good priors for the thresholds and means of the categorical variable. We don’t know of any principled way to this. We have tried several different values for the variance of the priors, but in our model we have 5 categorical variables with a total of 22 thresholds, so trial-and-error seems hopeless. Since this is really just a secondary problem for us, perhaps we should ignore the missingness problem as most researchers do. If you have any suggestions or pointers, it would be much appreciated.

My reply:

1. Maybe the best approach would be to treat the categorical variables as continuous, then impute using a continuous model, then round the values at the end if you want discrete predictors. That’s what Chuanhai, Gary, and I did in our 1998 paper.

2. Another approach is to impute each variable conditional on all the others. That’s what we do in our “mi” package for multiple imputation in R.

3. Our bayespolr() function (in the “arm” package in R) has a default prior distribution for ordered logit that works pretty well. See here for more background on the ideas.