How to code and impute income in studies of opinion polls?

Nate Cohn asks:

What’s your preferred way to handle income in a regression when income categories are inconsistent across several combined survey datasets? Am I best off just handling this with multiple categorical variables? Can I safely create a continuous variable?

My reply:

I thought a lot about this issue when writing Red Sate Blue State. My preferred strategy is to use a variable that we could treat as continuous. For example when working with ANES data I was using income categories 1,2,3,4,5 which corresponded to income categories 1-16th percentile, 16-33rd, 34-66th, 67-95th, and 96-100th. If you have different surveys with different categories, you could use some somewhat consistent scaling, for example one survey you might code as 1,3,5,7 and another might be coded as 2,4,6,8. I expect that other people would disagree with this advice but this the sort of thing that I was doing. I’m not so much worried about the scale being imperfect or nonlinear. But if you have a non-monotonic relation, you’ll have to be more careful.

Cohn responds:

Two other thoughts for consideration:

— I am concerned about non-monotonicity. At least in this compilation of 2020 data, the Democrats do best among rich and poor, and sag in the middle. It seems even more extreme when we get into the highest/lowest income strata, ala ANES. I’m not sure this survives controls—it seems like there’s basically no income effect after controls—but I’m hesitant to squelch a possible non-monotonic effect that I haven’t ruled out.

—I’m also curious for your thoughts on a related case. Suppose that (a) dataset includes surveys that sometimes asked about income and sometimes did not ask about income, (b) we’re interested in many demographic covariates, besides income, and; (c) we’d otherwise clearly specify the interaction between income and the other variables. The missing income data creates several challenges. What should we do?

I can imagine some hacky solutions to the NA data problem outright removing observations (say, set all NA income to 1 and interact our continuous income variable with whether we have actual income data), but if we interact other variables with the NA income data there are lots of cases (say, MRP where the population strata specifies income for full pop, not in proportion to survey coverage) where we’d risk losing much of the power gleaned from other surveys about the other demographic covariates. What should we do here?

My quick recommendation is to fit a model with two stages, first predicting income given your other covariates, then predicting your outcome of interest (issue attitude, vote preference, whatever) given income and the other covariates. You can fit the two models simultaneously in one Stan program. I guess then you will want some continuous coding for income (could be something like sqrt(income) with income topcoded at $300K) along with a possibly non-monotonic model at the second level.

7 thoughts on “How to code and impute income in studies of opinion polls?

  1. Joshua:

    Yes, the usual pattern is that the Democrats do better among lower-income voters and the Republicans do better among upper-income voters. We often refer to upper-income voters as “the rich,” but polls don’t get a lot of rich people (perhaps roughly defined as top 1% or top 5% of income) so a lot of this is speculation. There’s this one survey I know about of rich people: https://faculty.wcas.northwestern.edu/jnd260/cab/CAB2012%20-%20Page1.pdf but it’s already over 10 years old.

    • Andrew –

      But I wonder to what extent the dynamics have shifted, particularly in the age of Trump.

      I was surprised at the numbers in the polling that I linked, as it seems the conventional wisdom is that with an exodus of white “working class” (whatever that means these days) and non-4 year college voters, and even a small % of black males, Republicans have more or less became the party of lower income voters. That polling shows it ain’t that simple.

      It seems that it’s (perhaps become more) complicated, with urban versus rural being a huge factor, and gender being an increasingly important factor, mixed in with educational levels and still race/ethnic playing a roll although perhaps to a lesser extent..

      And how much does Trump explain any of recent changes? And what will happen post Trump. I pity the election forecaster. (I’m guessing that will include you soon enough?)

      • My understanding is that Trump won people > 100k mostly because of non 4-year college voters making that much. Whereas the more educated (and urban) voters making that much mostly voted Biden. I’m guessing that reflects changes?

  2. Its a Bayesian blog. What I don’t understand is why not do like you always do when you have an unknown? Make a parameter, use a prior which is that the income is uniform between the limits defined by the category, and then fit a curve, income vs whatever is of interest. This seems to be straight Bayesian modeling.

    • Daniel:

      Yes, I agree the best solution would be to include continuous income as a latent variable and then include it as a parameter in a Stan model. That’s just more work, and often I’d like to just fit a simple regression and go from there. This is the usual motivation for imputation rather than full modeling.

  3. What is the goal/purpose of this?

    Is it to draw some conclusion about the arbitrary coefficient/parameter values?

    Or is it to predict future voting results?

    In the first case, that is just confusion over the meaning of these numbers. Instead of reading tea leaves, switch to some other task.

    In the second case, why are you limiting yourself to a linear regression? Seems like a problem for gradient boosting to me. Regardless, try different approaches and use some balance of time/resource requirements and predictive skill to choose whatever works best.

Leave a Reply

Your email address will not be published. Required fields are marked *