Missing-data imputation question

Posted on January 11, 2006 2:55 PM by Andrew

Joost van Ginkel writes,

I [Joost] am currently developing a multiple-imputation method for missing data in test and questionaire data consisting of N persons and J Items. This method is based on the following two-way ANOVA model:

X_ij = mu_i. + mu_.j – mu_.. + e_ij,

where mu_.. is the overall mean, mu_i. is a row (person) mean of person i, mu_.j is a column (item) mean of item j, and e_ij is a random error component with variance sigma^2. I would like to impute test data according to this model using the data augmentation algorithm (Tanner & Wong, 1987). As you may know, this algorithm consists of two steps, namely a P(osterior) step and an I(mputation) step. In the P-step I want to draw the parameters from their posterior distributions, given the observed data, so I was considering to draw mu_.., mu_i., and mu_.j. A problem with this approach is, however, that the mean of all mu_i.’s as well as the mean of all mu_.j’s must be equal to mu. My question is: how can I sample these parameters with those restrictions?

One solution I thought of myself was to define the expected item score of person i on item j as

E(X_ij) = mu_ij = mu_i. + mu_.j – mu_..,

and to consider mu_ij as the only parameter to be drawn. In this way, the marginals are not drawn so that you get around the problem of those restrictions. Another thing I was thinking of was to draw all mu_i.’s, compute mu_.. as the average of all mu_i.’s, and finally draw mu_.1 to mu_.(J-1) and fix mu_.J. My question is: are these approaches justified? If not, do you have any other suggestions about how to resolve this problem?

My response:

The easiest approach is to think of the model as y_ij = mu + alpha_i + beta_j + e_ij, and to model the alphas and betas as coming from different normal distributions with population mean 0 (but not necessarily sample mean 0). That’s a straightforward crossed multilevel model which you could fit in lmer() in R, for example. You’re getting tangled because you’ve defined your model in terms of estimates (mu_.i, etc), rather than parameters (alpha_i, etc), thus you never specified a full probability model. Also you can see this paper for some related comments in a logistiic item response model.