Check your missing-data imputations using cross-validation

Elena Grewal writes:

I am currently using the iterative regression imputation model as implemented in the Stata ICE package. I am using data from a survey of about 90,000 students in 142 schools and my variable of interest is parent level of education. I want only this variable to be imputed with as little bias as possible as I am not using any other variable. So I scoured the survey for every variable I thought could possibly predict parent education. The main variable I found is parent occupation, which explains about 35% of the variance in parent education for the students with complete data on both. I then include the 20 other variables I found in the survey in a regression predicting parent education, which explains about 40% of the variance in parent education for students with complete data on all the variables.

My question is this: many of the other variables I found have more missing values than the parent education variable, and also, although statistically significant predictors in the complete case sample, have very small coefficients. Is this a problem? Is my method of including all the variables that were statistically significant predictors in the imputation model a valid strategy for deciding what to include in the imputation?

My reply:

Your imputation plan seems reasonable. To check it, you can do some cross-validation: randomly remove 1/5 (say) of the observations for your variable of interest, run the algorithm, then compare the held-out values to the random imputations. We did some of this in our 1998 paper but I still haven’t gotten around to formalizing the method.

The cross-validation check won’t save you if you have serious nonignorable missingness (for example, large values more likely than small values to be misreported), but it can be thought of as a minimal check.

8 thoughts on “Check your missing-data imputations using cross-validation

  1. Back in this post you said:
    Cross-validation is great and I’ve used it on occasion. I don’t really understand it. This is not a criticism, I just want to think harder about it at some point.
    Have things changed; if so, can I request a fuller post on cross-validation?

  2. I’m not an imputation specialist, but I do know that there are several worth while imputation techniques out there. Random Forest has an interesting way to impute data that you may consider.

  3. Just as a note, our Amelia package ( has a cross-validation check for imputation models (or, at least Amelia imputation models). We call our function “overimpute” since it involves imputing over observed values. Note that the performance of the cross-validation will depend heavily on the amount of missing data in the given observation. Somewhat obviously, observations with fewer observed variables will have “worse” imputation in the sense that variance of imputations will be quite high. I put worse in scare quotes because this just reflects the uncertainty inherent in the data, which is actually neutral.

    • Matt:

      Thanks for passing this on. It makes sense that Gary and I are both interested in this idea, given that we used it in our jointly-authored 1998 article on missing-data imputation!

  4. How does one “compare the held-out values to the random imputations?” I don’t think the goal should be to perfectly replicate the observed values.

    • Juned:

      If you could perfectly replicate the observed values, that would be great. Realistically, though, you just want to get as close as is possible given your data and assumptions.

  5. Awesome! “overimpute” is just what I was looking for. Thanks Matt for passing along the link, and thanks Andrew for your reply to my email and for this blog post.

Comments are closed.