Imputing count data

Guy asks:

I am analyzing an original survey of farmers in Uganda. I am hoping to use a battery of welfare proxy variables to create a single welfare index using PCA. I have quick question which I hope you can find time to address:

How do you recommend treating count data? (for example # of rooms, # of chickens, # of cows, # of radios)? In my dataset these variables are highly skewed with many responses at zero (which makes taking the natural log problematic). In the case of # of cows or chickens several obs have values in the hundreds.

My response: Here’s what we do in our mi package in R. We split a variable into two parts: an indicator for whether it is positive, and the positive part. That is, y = u*v. Then u is binary and can be modeled using logisitc regression, and v can be modeled on the log scale. At the end you can round to the nearest integer if you want to avoid fractional values.

3 thoughts on “Imputing count data

  1. Do you have any particular views/advice for whether fractional or integer values should be preferred, and why?

  2. Isn't that similar to Poisson Regression (which can be modeled using generalized linear model procedure in SAS: PROC GENMOD)?

Comments are closed.