Question 14 of my final exam for Design and Analysis of Sample Surveys

Posted on May 24, 2012 5:00 PM by Andrew

14. A public health survey of elderly Americans includes many questions, including “How many hours per week did you exercise in your most active years as a young adult?” and also several questions about current mobility and health status. Response rates are high for the questions about recent activities and status, but there is a lot of nonresponse for the question on past activity. You are considering imputing the missing values on the question, “How many hours per week did you exercise in your most active years as a young adult?” Which of the following statements are basically correct? (Indicate all that apply.)

(a) If done reasonably well, imputation is preferred to available-case and complete-case analysis.

(b) If you do impute, you should also present the available-case and complete-case analysis and analyze how the imputed estimates differ.

(c) It is OK to include current health status variables as predictors in a model imputing past activities: anything that adds information is good when imputing.

(d) It is probably not a good idea to include current health status variables as predictors in a model imputing past activities: current health is possibly influenced by past activities, and including a casual outcome can bias estimates of a treatment variable.

(e) If you fit a regression model and impute your best prediction for each person (rather than imputing random draws from the predictive distribution), you can have problems because you will be more likely to impute extreme values.

(f) It is a good idea to fit a logistic regression predicting response/nonresponse to the question of interest as a way to look for systematic differences between respondents and nonrespondents on this question.

Solution to question 13

From yesterday:

13. A survey of American adults is conducted that includes too many women and not enough men in the sample. In the resulting weighting, each female respondent is given a weight of 1 and each male respondent is given a weight of 1.5. The sample includes 600 women and 380 men, of whom 400 women and 100 men respond Yes to a particular question of interest. Give an estimate and standard error for the proportion of American adults who would answer Yes to this question if asked.

Solution: Define W1 = 600*1/(600*1+380*1.5) = 0.51, W2 = 380*1.5/(600*1+380*1.5) = 0.49, p1.hat = 400/600 = 0.67, p2.hat = 100/380 = 0.26. The desired estimate is then W1*p1.hat + W2*p2.hat = 0.47 and the standard error is sqrt(W1^2*p1.hat*(1-p1.hat)/600 + W2^2*p2.hat*(1-p2.hat)/380)=0.015.

2 thoughts on “Question 14 of my final exam for Design and Analysis of Sample Surveys”

Daniel Lakeland on May 25, 2012 2:44 AM at 2:44 am said:

The only thing that seems definitely wrong to me is (e), if you impute a “best prediction” this is more likely to be a central value (near the average) whereas random draws are more likely to contain tail values, unless I’m misunderstanding the question.

(c) and (d) both seem to be correct (in certain circumstances) even though they’re in essence supposed to be opposites. The fact of the matter is that the “goodness” of imputing the past from the present depends heavily on what you’re going to do with the imputed values. If you plan to use the present to predict the past and then use the predicted past to determine connections between the past and the present then you’re probably asking for circular logic type trouble.

But if you’re keeping the two models fairly orthogonal then you could be fine (for example imputing the past exercise history from present economic status, occupation, and location data and then predicting present medical treatment response from the imputed past exercise history or something like that). Also, bias is not necessarily a bad thing. I’d rather have a biased estimate of how well a treatment will work which is on average reasonably accurate than an unbiased estimate whose random error is huge. Somehow i’m not sure these are the responses you expected from the exam question though.
Pingback: Question 15 of my final exam for Design and Analysis of Sample Surveys « Statistical Modeling, Causal Inference, and Social Science

Comments are closed.