Logistic regression and glm defaults

When you run glm in R, the default is linear regression. To do logistic, you need a slightly awkward syntax, e.g.,

fit.1 <- glm (vote ~ income, family=binomial(link="logit"))

This isn’t so bad, but it’s a bit of a pain when doing routine logistic regressions. We should alter glm() so that it first parses the y-variable and then, by default, chooses logit if it’s binary, linear if it’s continuous, and overdispersed Poisson if it’s nonnegative integers with more than two categories, maybe ordered logit for some other cases. If you don’t like defaults, you can always specify the model explicitly (as you currently must do anyway).

Aliasing the standard glm() function is awkward so we’ll start by putting these defaults into bayesglm() (which I think we should also call bglm() for ease of typing) in the “arm” package in R. The categorization of the variables fits in well with what we’re doing in the “mi” package for missing-data imputation.

9 thoughts on “Logistic regression and glm defaults

  1. "logit" is the default link function for the binomial family, you don't need to specify it explicitly in the glm call– i.e.,

    glm(vote ~ income, family=binomial)
    will work as well. I wrap glm with "bglm" and "pglm" in my utility library, so I agree that having the shortcuts is nice, but I wonder if the code would run into situations where it had to guess if it should use a gaussian family or a poisson family, which could confuse an unsuspecting user.

  2. The alternative approach (which Mplus, e.g., uses) is to choose the analysis based on the variable type. So, if the outcome is a factor, it chooses logistic regression. Mplus has more types (categorical [for ordered], nominal, count, inflated count, survival, and continuous).
    I like the Mplus way more, but maybe that's because I'm used to it.

  3. Jeremy,

    Yes, exactly, that's what we want to do. We already have variable types in mi and autograph. My only quibble with the Mplus categories is that, for Poisson regression, I want overdispersed (with an offset) as the default.

  4. There is something to be said for being explicit – the advantage of the more verbose style is that someone reading your code can immediately tell what type of model has been fitted.

    Also, changes to your data (e.g. a typo turns a numeric into a factor) could produce silent changes in model behaviour, instead of throwing an error.

  5. Hadley,

    Yes, being explicit is why we kept things as they were when writing the book. But I do find it a bit awkward to type all those things to do logistic regression. In particular, I almost always do logit with 0/1 data (rather than with "y out of n" data), hence the "binomial" family is not a natural way for me to think of it. Also the R syntax is awkward, with binomial not in quotes but "logit" in quotes.

    Regarding your worry about typos: the solution here is to have the full call displayed in the display of the model. For example, supposing y is a vector of 0's and 1's:

    <pre>
    > M1 = glm (y ~ x)
    > display (M1)
    glm(formula = y ~ x, family = binomial(link = "logit"))
    coef.est coef.se
    (Intercept) 5.02 9.21
    x 7.48 12.30
    n = 5, k = 2
    residual deviance = 3.5, null deviance = 6.7 (difference = 3.3)
    </pre>

    So, if something went wrong, you'd see it at the top of the display.

  6. I think this isn't a good idea. The problem is that

    the model used will vary depending on the particular

    data, making any automatic use of these functions

    that doesn't explicitly state the model unreliable. And of course, out of laziness,

    many peope won't explicitly state the model,

    even if they should.

    This is an existing problem in R. Someone thought

    that automatically determining which way a sequence

    goes based on the upper and lower bounds was a good

    idea. So 1:0 produces the backwards sequence 1 0.

    Consequently, huge numbers of R programs have subtle

    bugs, in which 1:n does the wrong thing when n is 0.

    It's tempting to change the language to make

    typing in ad hoc commands easier, but

    in R the same commands are used in programs,

    where consistent behaviour is essential to

    avoid buggy software.

  7. Radford,

    The alternative to a good default is not no default, it's a bad default. In this case, the default for glm is to use the Gaussian model. I understand your worry but I think it is somewhat fixed by the first line in the display which shows the model that was actually fit.

  8. The problem is that glm is not only used by
    people, who then look at the output, but also
    inside functions, where a model is fitted,
    the coefficients used for something (eg, in a
    simulation study) and nobody ever looks at what
    would have been printed if print(summary(m)) had
    been typed. If the default depends on the data,
    this will result, for instance, in simulation
    studies of Poisson models in which 1 time in 1000
    the data happens to consist only of 0s and 1s,
    and a logistic model is fit instead…

  9. Radford,

    Yes, I see your point. If the function is being called in a loop like that, it would be better to explicitly specify the model. The defaults could indeed screw things up invisibly in your simulation study example!

    Perhaps we should just have a simple alias function called logistic() which runs glm() with the logistic default. Or an option "family=logistic" which would be an alias for the existing 'family=binomial(link="logit")'.

    Also, I never use print(summary(model)) anymore. I only use display(model).

Comments are closed.