Whassup with glm()?

We’re having problem with starting values in glm(). A very simple logistic regression with just an intercept with a very simple starting value (beta=5) blows up.

Here’s the R code:

> y <- rep (c(1,0),c(10,5))
> glm (y ~ 1, family=binomial(link="logit"))

Call:  glm(formula = y ~ 1, family = binomial(link = "logit"))


Degrees of Freedom: 14 Total (i.e. Null);  14 Residual
Null Deviance:      19.1 
Residual Deviance: 19.1         AIC: 21.1 
> glm (y ~ 1, family=binomial(link="logit"), start=2)

Call:  glm(formula = y ~ 1, family = binomial(link = "logit"), start = 2)


Degrees of Freedom: 14 Total (i.e. Null);  14 Residual
Null Deviance:      19.1 
Residual Deviance: 19.1         AIC: 21.1 
> glm (y ~ 1, family=binomial(link="logit"), start=5)

Call:  glm(formula = y ~ 1, family = binomial(link = "logit"), start = 5)


Degrees of Freedom: 14 Total (i.e. Null);  14 Residual
Null Deviance:      19.1 
Residual Deviance: 360.4        AIC: 362.4 
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred

What’s going on?

P.S. Just to be clear: my problem is not with the “fitted probabilities numerically 0 or 1 occurred” thing. I don’t care about warning messages. My problem is that when I start with a not-ridiculous starting value of 5, that glm does not converge to the correct estimate of 0.6931, instead it blows up. It shouldn’t blow up!

29 thoughts on “Whassup with glm()?

  1. The following just means that you have a binary variable that perfectly predicts your response variable:

    fitted probabilities numerically 0 or 1 occurred

  2. Kieran:

    No. See P.S. above. I think it's something with the linear approximation that causes the estimates to blow up. In any case, it's annoying.

  3. The help function for glm has a reference to MASS which explains why you get the 'fitted probabilities numerically 0 or 1 occurred' and why that in turn gives infinite estimates. The section is titled Problems with binomial GLMs.

  4. I did debug(glm.fit) and yes, in the first IRLS step the weights are all very small because of the large initial eta, and as a result it hops out to a huge value (-44) on the first jump and then to 3*10^15. The underlying taylor approximation blew up.

    5 is an extreme value for a log-odds. pr{1}=0.9933071 on 15 data points.

  5. Using the extra parameter control=list(maxit=1)) shows that the first step of the algorithm goes to -44.xxx, and the second step blows up the rest of the way.

    It seems to blow up with a start of around 2.7 or greater, or -1.9 or less.

    This looks like a frequent Newton problem that can occur when no test is done to see if an actual improvement in the objective was made after each step; the Newton step is way too big, albeit in the right direction, and things go downhill from there (ha ha.) This can happen when the derivative of the objective w.r.t. the parameter is very small, but for these parameter values? I'd have to go through the IRLS steps by hand to see what actually happens… maybe do a little rewrite of the glm.fit code to print out the intermediate information. I know that for the canonical link IRLS=Newton and log-concavity means that it should always converge, and I can't see why such small parameter values would be bad starting locations with a concave log likelihood.

  6. Maybe I'm missing something, but it looks like 5 is fairly far from the optimum (it implies a predicted probability of 0.99) and in a part of the parameter space that is locally linearly flat. I'm not as sure about IWLS algorithms, but with EM, I often find that starting the algorithm "close" to boundary (such as 0.99) will run toward the boundary instead of away, due to numerical errors. It is terribly annoying. Any reason to not start at the coefficients from a linear model?

  7. Maybe the Newton step is overshooting the maximum? The log-likelihood is concave, but you are trying to find the root of its derivative, which is neither concave nor convex.

  8. The fortran code probably has the answer, but when I run this with some data with the glm.control(trace=TRUE) I see that the deviance doesn't change from iteration 2 to 3.

    Deviance = 8001.691 Iterations – 1
    Deviance = 4469.413 Iterations – 2
    Deviance = 4469.413 Iterations – 3

    For some reason the coefs blow up (had to add a cat() statement to glm.fit):

    Coefs = -47.90097 Iterations – 1
    Coefs = 2.889593e+15 Iterations – 2
    Coefs = 1.275586e+15 Iterations – 3

  9. Ryan:

    Sure, 5 is far from the correct estimate. But it's not extreme! A computer should be able to handle estimates of 0.99!


    Sure, in any given case this isn't a problem but we are calling glm within a larger program, and we'd rather not have to worry about this sort of numerical instability occurring. Logistic regression is a simple enough problem that I think it's reasonable to have a program that converges from any numerically stable starting point.

    John, FG:

    Yes, that's what I suspected was happening. I suppose I could add a step to glm.fit() to check if the loglikelihood decreases and, if so, reduce the step size. I was just surprised this was happening at all; I'd just always assumed that glm() was a stable function.

    Maybe R should have something better than glm() as its default for fitting logistic regressions?

  10. Sounds like a bug to be fixed in glm.fit, then (or perhaps glm should default to more sensible starting values like the lm estimates as suggested above, perhaps with a scaling factor for different links). But in the general case there's no way to guarantee that you won't explore the wrong part of the likelihood even for "simple" problems like logistic regression (at least with higher dimensions), so if you give glm – or any ML technique – dumb starting values you'll likely get dumb behavior as a result.

  11. Chris:

    I don't know if it's a bug or a problem with the algorithm. But I don't completely agree with your conclusion. First, a starting point of 5 (or, for that matter, 2.7) is not so far off. Second, this problem is rare–I've been running glm() for a long time and had never encountered it before. If we can fix glm() and make the problem even more rare, that would be a useful contribution. Even if numerical instability will always be with us, it is good to reduce its frequency.

  12. It's just the convergence domain of the Fisher scoring algorithm. If x is the current estimate, each step is:

    g = 10 – 15*exp(x)/(1+exp(x))
    f = (15*exp(x))/(1+exp(x))^2
    xnew = x + g*f^-0.5
    x = xnew

    If a poor starting value is chosen the suggested step doesn't result in an improvement in the log-likelihood.

    Maybe glm should allow an option of a line search within the Fisher scoring algorithm so the next step is

    xnew = x + d*g*f^-0.5

    where d is the value in (0,1) that gives the best improvement in likelihood, then at least convergence is guaranteed for concave likelihoods. Obviously this is completely pointless for this one dimensional example since it would require optimize() which could be just used on the likelihood itself.

  13. Is this a reflection of the same problem underlying the poor performance of logit for rare events? If so, it shouldn't happen if you try Zelig's relogit instead of GLM.

  14. Michael:

    No, in this case there are 10 successes and 5 failures. Not a rare-event situation at all! You can also get the problem in other settings, for example 200 success and 200 failures. The algorithm just has problems with some starting points.

  15. " there are 10 successes and 5 failures. Not a rare-event situation at all.."

    But you are examining a starting value that would only be near the true value for a very rare event. I have almost no recollection of what the underlying math is here, but my thought was that if it had something to do with the likelihood at values near the parameter boundary it could become manifest in this way as well.

  16. Andrew, I agree that this is an unfortunate problem with glm() and it would be nice if someone fixed it. But that doesn't need to happen in order to resolve your issue, does it? I suspect you can specify a start value of ln(r/(1.001-r)) where r is the observed success rate, and this should be close enough that the instability never matters. Presumably this is something like glm's default start value (I haven't checked), so perhaps you can just avoid specifying the start value and all will be well.

    What I'm saying is: why not always start at something close to ln(r/(1-r))? I agree you shouldn't have to, but what's the harm?

  17. Why do you think this is a bug in glm() or glm.fit(). It found perfectly good starting values and converged to the correct solution Andrew was expecting when left to it's own devices. It is "blowing up" when provided with starting values that *it* considers sub-optimal (or can't deal with) but Andrew and others feel are not overly extreme.

  18. Phil:

    Yes, there's an easy solution to this particular problem but what was happening was that we were calling glm.fit() within a larger program in which we were using earlier estimates as starting values. We indeed found a work-around in this case but I thought that others might want to know about this problem that happened to us.


    I don't think it's a bug in glm.fit() so much as a hole in the algorithm. You put "blowing up" in quotes but the estimate really did blow up. As discussed in comments above, an algorithm that comes to a good solution from a good starting point is fine, but it's even better if an algorithm can come to a good solution from a bad starting point. Cos sometimes these things happen.

    We encountered this algorithmic instability in glm() as part of a different project we were doing, and now I think I want to put a small amount of effort into making glm.fit() more stable so that this doesn't happen again, or at least happens more rarely. Or maybe the best option is to replace glm.fit() by some other function; I don't know what's out there.

  19. FWIW, doesn't happen in Stata's default:

    . set obs 15
    obs was 0, now 15

    . gen y = 0

    . replace y = 1 if _n chi2 = .
    Log likelihood = -9.5477125 Pseudo R2 = -0.0000

    y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
    _cons | .6931471 .5477226 1.27 0.206 -.3803693 1.766664

  20. Anyone wanting to run Cyrus's code for themselves in Stata should replace

    replace y = 1 if _n


    replace y = 1 if _n > 5

  21. Try starting value of -2 (corresponding to 11% success probability) and things get even more worrysome.
    You get a warning, but also an estimate that does not look fully pathological.
    > glm (y ~ 1, family=binomial(link="logit"), start=-2)

    Call: glm(formula = y ~ 1, family = binomial(link = "logit"), start = -2)


    Degrees of Freedom: 14 Total (i.e. Null); 14 Residual
    Null Deviance: 19.1
    Residual Deviance: 360.4 AIC: 362.4
    Warning message:
    glm.fit: fitted probabilities numerically 0 or 1 occurred

  22. Right, that was clipped for some reason when I pasted it in.

    Using even crazier starting values doesn't cause Stata's logit to fail either. It might add another step or two, but no problems. The point in showing it is that we're seeing an algorithmic pathology in glm() that others don't have.

  23. From the stata manual:

    Commands use the
    Newton-Raphson method with step halving and special fixups when they
    encounter nonconcave regions of the likelihood. For details, see [M-5]
    moptimize and [M-5] optimize.

    glm.fit is always taking the full newton step. You could very easily modify it to take a fractional step, or as in bfgs do a few points of line search. Seems like stata's optimizer actually checks that the likelihood is behaving as expected.

Comments are closed.