Guy asks:
I am analyzing an original survey of farmers in Uganda. I am hoping to use a battery of welfare proxy variables to create a single welfare index using PCA. I have quick question which I hope you can find time to address:
How do you recommend treating count data? (for example # of rooms, # of chickens, # of cows, # of radios)? In my dataset these variables are highly skewed with many responses at zero (which makes taking the natural log problematic). In the case of # of cows or chickens several obs have values in the hundreds.
My response: Here’s what we do in our mi package in R. We split a variable into two parts: an indicator for whether it is positive, and the positive part. That is, y = u*v. Then u is binary and can be modeled using logisitc regression, and v can be modeled on the log scale. At the end you can round to the nearest integer if you want to avoid fractional values.
Do you have any particular views/advice for whether fractional or integer values should be preferred, and why?
Isn't that similar to Poisson Regression (which can be modeled using generalized linear model procedure in SAS: PROC GENMOD)?
I suggest using a zero inflated negative binominal regression model. Best, Regula