Sy Spilerman writes,

I am interested in the effect of log(family wealth) on some dependent variable, but I have negative and zero wealth values. I could add a constant to family wealth so that all values are positive. But I think that families with zero and negative values may behave differently from positive wealth families. Suppose I do the following: Decompose family wealth into three variables: positive wealth, zero wealth, and negative wealth, as follows:

– positive wealth coded as ln(wealth) where family wealth is positive, and 0 otherwise,

– zero wealth coded 1 if the family has zero wealth, 0 otherwise.

– negative wealth coded ln(absolute value of wealth) if family wealth is negative, and 0 otherwise,and then use this coding as right side variables in a regression. It seems to me that this coding would permit me to obtain the separate effects of these three household statuses on my dependent variable (e.g., educational attainment of offspring). Do you see a problem with this coding? A better suggestion?

My reply:

Yes, you could do it this way. I think then you’d want to include values very close to zero (for example, anything less than $100 or maybe $1000 in absolute value) as zero. But yes, this should work ok. Another option is to just completely discretize it, into 10 categories, say.

Any other suggestions out there? This problem arises occasionally, and I’ve seen some methods that seem very silly to me (for example, addiing a constant to all the data and then taking logs). Obviously the best choice of method will depend on details of the application, but it is good to have some general advice too.

Ignoring the endogeneity problem that comes from regressing something like wealth on something like education, you might want to think about some sort of censoring model, like Tobit, Logit, or Heckit. If you want to be agnostic about functional form, it gets a lot harder, though.

A good way is to use the wealth distribution curves. I believe you can find this at the census website. Then divide the wealth distribution curve into bins. For example, the bottom 10% of wealth, the next 10% of wealth, etc… The negative wealth will be included in the bottom decile of the wealth distribution curve.You can make the bin sizes as small as you want. For example, if you want very fine resolution, each bin could be 1% of the wealth distrubiton curve.Did this help?

Austin

If using Professor Spilerman's decomposition, I'd add a dummy for negative wealth, so all three groups can have different intercepts.

How was your 'wealth' variable measured? Wealth is often calculated by by subtracting total debt from total assets. If this is the case, then you might want to consider breaking wealth down into its constituent elements of assets and debts (assuming they are available). I can see there being a big difference between the lives of someone who has few assets and a little debt and someone who has many assets and lots of debt; if you used only a wealth variable then these two very different cases would be treated as similar. Additionally, as both assets and debts are positive, you won't run into the negative number problem described above.

Noah

If it doesn't have to be exactly the logarithm then, intuitively without much reasoning about the pros and cons, I would propose that "f(x) := log(1/(1-x)) for all x = 10 it gets pretty close to |log(x)| itself, and if most your |x|'s are smaller than that you could alway scale them before applying f, plus f happens to be symmetric around 0, which might be desirable.

I would guess that the reason you'd like to use log(w) is the expectation that if there is a 10% increase in wealth, you would see a 10% increase in some variable c, inspiring you to fit a linear relation between log(w) and log(c). But although such a log-linear relation might be reasonable for values of w above some threshold w_0, it obviously is unreasonable (indeed, undefined) for values of w less than or equal to 0. One fix would be to hypothesize that log(c) depends linearly on w for w less than or equal to w_0, and linearly on log(w) for w greater than w_0. This would imply using the transformation h(w), where h(w) = log(w) for w > w_0, and h(w) = [log(w_0) – 1] + (w/w_0) for w less than or equal to w_0.

Ummm, doesn't this beg the question of why, if you have a a variable that can take on negative or zero values you're trying it to log it in the first place? What kind of model are you exactly estimating? Presumably the regression equation is derived from some theoretical equation (unless you're doing reduced form) and if that theory says the variable can be negative or zero then you just can't take logs and that's it. If you're trying to linearize some multiplicative relationship to use linear regression then, well, you can't – you're estimating something else. In that case you should do some kernel type non linear regression – as one of the commentators suggested (bins and all that stuff, it's been awhile)