Why use count-data models (and don’t talk to me about BLUE)

Someone who wishes to remain anonymous writes,

I have a question for you. Count models are to be used when you have outcome variables that are non-negative integers. Yet, OLS is BLUE even with count data, right? I don’t think we make any assumptions with least squares about the nature of the data, only about the sampling, exogeneous regressors, rank K, etc. So why technically should we use count if OLS is BLUE even with count data?

My reply:

1. I don’t really care if something is BLUE (best linear unbiased estimator) because (a) why privilege linearity, and (b) unbiasedness ain’t so great either. See Bayesian Data Analysis for discussion of this point (look at “unbiased” in the index), also this paper for a more recent view.

2. Least squares is fine with count data, it’s even usually ok with binary data. (This is commonly known, and I’m sure it’s been written in various places but I don’t happen to know where.) For prediction, though, you probably want something that predicts on the scale of the data, which would mean discrete predictions for count data. Also, a logarithmic link makes sense in a lot of applications (that is, E(y) is a linear function of exp(x)), and you can’t take log of 0, which is a good reason to use a count data model.

7 thoughts on “Why use count-data models (and don’t talk to me about BLUE)

  1. I can't remember exactly which paper it is in, but Heckman showed that you can confidently use OLS for binary data, as long as the predicted estimates of the depdendent variable are bounded by for all potential values of the IVs.

  2. OK, maybe you can do so, but why would you want to?

    I mean, if logistic regression didn't exist, or wasn't implemented in your statistical software, or something, I could see that it would be good to know when you could use OLS for binary data.

    But, given that logistic does exist and is implemented in every package, why use OLS?

    Also, how would you go about demonstrating that the predicted value is bounded 0,1? Even if it were bounded for current data, how would you know about future data that might come in?

  3. There would also be a loss of efficiency in not using the appropriate model, so I can't see where "best" comes from.

    While looking for something else I found a reference to Cox and Snell, "The Analysis of Binary Data" 2nd edition p 14 "… the use of a model, the nature of whose limitations can be foreseen, is not wise, except for very restricted purposes."

  4. In reply to mrjuak's question about "why use OLS?" I am not aware of any compelling model specification statistics for logistic models, analogous to LM or likelihood ratio. Hosmer-Lemenshaw is a convention recommendation, but it seems to be without much theoretical foundation. I'd welcome any recommendations on ways evaluate logistic models.

    I should say that, as mrjuak suggests, I do use logistic regression for binary data, but have struggled to justify my selection of a particular logistic model from its competitors.

  5. Andrew: On your second point, the issue here is not so much prediction as it is efficiency. OLS is BLUE (kind of, see next paragraph) but it is no longer best (as it does not take into account the special characteristics of the variance), so we are better off using count models.

    Another important issue that your correspondent needs to take into account is inference: with a binary dependent variable the errors are no longer homoskedastic, so appropriate heteroskedasticity robust standard errors need to be computed.

    On your first point, I agree that unbiasedness is over-rated. Linearity, however, has a few things going for it. First of all, you can always use quadratics/logs etc of variables, so it is not as restrictive as it appears at first. Secondly, it's usually not a bad starting point, and it offers some protection against data-mining – but I guess this is a long discussion for some other time, especially given that datamining can be a good thing to do.

    Ken: 'There would also be a loss of efficiency in not using the appropriate model, so I can't see where "best" comes from.':

    The word here is 'Best Linear' rather than 'Best'. So OLS is not best, but it's best in the class of linear estimators.

    And some pedantry: 'efficiency' is binary. Either an estimator is efficient, or it's not. You can even stretch the concept to talk about relative efficiency, as in 'estimator A is efficient relative to estimator B'. But you can't have a 'loss of efficiency'.

  6. Hello,
    I wonder whether anybody have any sense of how to deal with binary outcomes, fixed effects and a large-scale datasets (milions of observations) . A logistic regression would take a long time to converge and I am even sure it will.
    When the dataset is this big, are OLS and logisitc really different? is anybody aware of any literature that has looked at this?
    Thanks

  7. Let's say that I fit a probit model in which the response variable is vote choice and the covariates are age, educ and income. And in the model, I wonder what happens to the parameter estimates and standard errors if the data are heteroskedastic? How could we address problems resulting from heteroskedasticity?

Comments are closed.