# Whassup with glm()?

We’re having problem with starting values in glm(). A very simple logistic regression with just an intercept with a very simple starting value (beta=5) blows up.

Here’s the R code:

```> y <- rep (c(1,0),c(10,5))
> glm (y ~ 1, family=binomial(link="logit"))

Call:  glm(formula = y ~ 1, family = binomial(link = "logit"))

Coefficients:
(Intercept)
0.6931

Degrees of Freedom: 14 Total (i.e. Null);  14 Residual
Null Deviance:      19.1
Residual Deviance: 19.1         AIC: 21.1
> glm (y ~ 1, family=binomial(link="logit"), start=2)

Call:  glm(formula = y ~ 1, family = binomial(link = "logit"), start = 2)

Coefficients:
(Intercept)
0.6931

Degrees of Freedom: 14 Total (i.e. Null);  14 Residual
Null Deviance:      19.1
Residual Deviance: 19.1         AIC: 21.1
> glm (y ~ 1, family=binomial(link="logit"), start=5)

Call:  glm(formula = y ~ 1, family = binomial(link = "logit"), start = 5)

Coefficients:
(Intercept)
1.501e+15

Degrees of Freedom: 14 Total (i.e. Null);  14 Residual
Null Deviance:      19.1
Residual Deviance: 360.4        AIC: 362.4
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
```

What’s going on?

P.S. Just to be clear: my problem is not with the “fitted probabilities numerically 0 or 1 occurred” thing. I don’t care about warning messages. My problem is that when I start with a not-ridiculous starting value of 5, that glm does not converge to the correct estimate of 0.6931, instead it blows up. It shouldn’t blow up!

## 29 thoughts on “Whassup with glm()?”

1. Greg on said:

The following just means that you have a binary variable that perfectly predicts your response variable:

fitted probabilities numerically 0 or 1 occurred

2. Greg on said:

Guess this is one of those situations where I read too quickly.

3. Maybe there is a "fitted 0 or 1" condition that gets hit from the start of the fit based on your data that is causing this to stop looking for a solution.

4. Kieran on said:

Is this your problem?

5. Andrew Gelman on said:

Kieran:

No. See P.S. above. I think it's something with the linear approximation that causes the estimates to blow up. In any case, it's annoying.

6. Kieran on said:

Duh. I need to learn to read.

7. Andrew Gelman on said:

Ryan:

But 5 is not an extreme value. It can't be blowing up on the first step.

8. The help function for glm has a reference to MASS which explains why you get the 'fitted probabilities numerically 0 or 1 occurred' and why that in turn gives infinite estimates. The section is titled Problems with binomial GLMs.

9. C Ryan King on said:

I did debug(glm.fit) and yes, in the first IRLS step the weights are all very small because of the large initial eta, and as a result it hops out to a huge value (-44) on the first jump and then to 3*10^15. The underlying taylor approximation blew up.

5 is an extreme value for a log-odds. pr{1}=0.9933071 on 15 data points.

10. john on said:

Using the extra parameter control=list(maxit=1)) shows that the first step of the algorithm goes to -44.xxx, and the second step blows up the rest of the way.

It seems to blow up with a start of around 2.7 or greater, or -1.9 or less.

This looks like a frequent Newton problem that can occur when no test is done to see if an actual improvement in the objective was made after each step; the Newton step is way too big, albeit in the right direction, and things go downhill from there (ha ha.) This can happen when the derivative of the objective w.r.t. the parameter is very small, but for these parameter values? I'd have to go through the IRLS steps by hand to see what actually happens… maybe do a little rewrite of the glm.fit code to print out the intermediate information. I know that for the canonical link IRLS=Newton and log-concavity means that it should always converge, and I can't see why such small parameter values would be bad starting locations with a concave log likelihood.

11. Maybe I'm missing something, but it looks like 5 is fairly far from the optimum (it implies a predicted probability of 0.99) and in a part of the parameter space that is locally linearly flat. I'm not as sure about IWLS algorithms, but with EM, I often find that starting the algorithm "close" to boundary (such as 0.99) will run toward the boundary instead of away, due to numerical errors. It is terribly annoying. Any reason to not start at the coefficients from a linear model?

12. Manoel Galdino on said:

No Idea, but here it blows up with 2.7 as a starting value (R 2.13.0)

13. FG on said:

Maybe the Newton step is overshooting the maximum? The log-likelihood is concave, but you are trying to find the root of its derivative, which is neither concave nor convex.

14. The fortran code probably has the answer, but when I run this with some data with the glm.control(trace=TRUE) I see that the deviance doesn't change from iteration 2 to 3.

Deviance = 8001.691 Iterations – 1
Deviance = 4469.413 Iterations – 2
Deviance = 4469.413 Iterations – 3

For some reason the coefs blow up (had to add a cat() statement to glm.fit):

Coefs = -47.90097 Iterations – 1
Coefs = 2.889593e+15 Iterations – 2
Coefs = 1.275586e+15 Iterations – 3

15. Andrew Gelman on said:

Ryan:

Sure, 5 is far from the correct estimate. But it's not extreme! A computer should be able to handle estimates of 0.99!

Matt:

Sure, in any given case this isn't a problem but we are calling glm within a larger program, and we'd rather not have to worry about this sort of numerical instability occurring. Logistic regression is a simple enough problem that I think it's reasonable to have a program that converges from any numerically stable starting point.

John, FG:

Yes, that's what I suspected was happening. I suppose I could add a step to glm.fit() to check if the loglikelihood decreases and, if so, reduce the step size. I was just surprised this was happening at all; I'd just always assumed that glm() was a stable function.

Maybe R should have something better than glm() as its default for fitting logistic regressions?

16. Sounds like a bug to be fixed in glm.fit, then (or perhaps glm should default to more sensible starting values like the lm estimates as suggested above, perhaps with a scaling factor for different links). But in the general case there's no way to guarantee that you won't explore the wrong part of the likelihood even for "simple" problems like logistic regression (at least with higher dimensions), so if you give glm – or any ML technique – dumb starting values you'll likely get dumb behavior as a result.

17. Andrew Gelman on said:

Chris:

I don't know if it's a bug or a problem with the algorithm. But I don't completely agree with your conclusion. First, a starting point of 5 (or, for that matter, 2.7) is not so far off. Second, this problem is rare–I've been running glm() for a long time and had never encountered it before. If we can fix glm() and make the problem even more rare, that would be a useful contribution. Even if numerical instability will always be with us, it is good to reduce its frequency.

18. aztek on said:

It's just the convergence domain of the Fisher scoring algorithm. If x is the current estimate, each step is:

g = 10 – 15*exp(x)/(1+exp(x))
f = (15*exp(x))/(1+exp(x))^2
xnew = x + g*f^-0.5
x = xnew

If a poor starting value is chosen the suggested step doesn't result in an improvement in the log-likelihood.

Maybe glm should allow an option of a line search within the Fisher scoring algorithm so the next step is

xnew = x + d*g*f^-0.5

where d is the value in (0,1) that gives the best improvement in likelihood, then at least convergence is guaranteed for concave likelihoods. Obviously this is completely pointless for this one dimensional example since it would require optimize() which could be just used on the likelihood itself.

19. Michael on said:

Is this a reflection of the same problem underlying the poor performance of logit for rare events? If so, it shouldn't happen if you try Zelig's relogit instead of GLM.

20. Andrew Gelman on said:

Michael:

No, in this case there are 10 successes and 5 failures. Not a rare-event situation at all! You can also get the problem in other settings, for example 200 success and 200 failures. The algorithm just has problems with some starting points.

21. Michael on said:

" there are 10 successes and 5 failures. Not a rare-event situation at all.."

But you are examining a starting value that would only be near the true value for a very rare event. I have almost no recollection of what the underlying math is here, but my thought was that if it had something to do with the likelihood at values near the parameter boundary it could become manifest in this way as well.

22. Phil on said:

Andrew, I agree that this is an unfortunate problem with glm() and it would be nice if someone fixed it. But that doesn't need to happen in order to resolve your issue, does it? I suspect you can specify a start value of ln(r/(1.001-r)) where r is the observed success rate, and this should be close enough that the instability never matters. Presumably this is something like glm's default start value (I haven't checked), so perhaps you can just avoid specifying the start value and all will be well.

What I'm saying is: why not always start at something close to ln(r/(1-r))? I agree you shouldn't have to, but what's the harm?

23. Gavin Simpson on said:

Why do you think this is a bug in glm() or glm.fit(). It found perfectly good starting values and converged to the correct solution Andrew was expecting when left to it's own devices. It is "blowing up" when provided with starting values that *it* considers sub-optimal (or can't deal with) but Andrew and others feel are not overly extreme.

24. Andrew Gelman on said:

Phil:

Yes, there's an easy solution to this particular problem but what was happening was that we were calling glm.fit() within a larger program in which we were using earlier estimates as starting values. We indeed found a work-around in this case but I thought that others might want to know about this problem that happened to us.

Gavin:

I don't think it's a bug in glm.fit() so much as a hole in the algorithm. You put "blowing up" in quotes but the estimate really did blow up. As discussed in comments above, an algorithm that comes to a good solution from a good starting point is fine, but it's even better if an algorithm can come to a good solution from a bad starting point. Cos sometimes these things happen.

We encountered this algorithmic instability in glm() as part of a different project we were doing, and now I think I want to put a small amount of effort into making glm.fit() more stable so that this doesn't happen again, or at least happens more rarely. Or maybe the best option is to replace glm.fit() by some other function; I don't know what's out there.

25. FWIW, doesn't happen in Stata's default:

<pre>
. set obs 15
obs was 0, now 15

. gen y = 0

. replace y = 1 if _n chi2 = .
Log likelihood = -9.5477125 Pseudo R2 = -0.0000

——————————————————————————
y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
————-+—————————————————————-
_cons | .6931471 .5477226 1.27 0.206 -.3803693 1.766664
——————————————————————————
</pre>

26. Nick Cox on said:

Anyone wanting to run Cyrus's code for themselves in Stata should replace

replace y = 1 if _n

by

replace y = 1 if _n > 5

27. Marcel Zwahlen on said:

Try starting value of -2 (corresponding to 11% success probability) and things get even more worrysome.
You get a warning, but also an estimate that does not look fully pathological.
> glm (y ~ 1, family=binomial(link="logit"), start=-2)

Call: glm(formula = y ~ 1, family = binomial(link = "logit"), start = -2)

Coefficients:
(Intercept)
71.63

Degrees of Freedom: 14 Total (i.e. Null); 14 Residual
Null Deviance: 19.1
Residual Deviance: 360.4 AIC: 362.4
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred

28. Cyrus on said:

Right, that was clipped for some reason when I pasted it in.

Using even crazier starting values doesn't cause Stata's logit to fail either. It might add another step or two, but no problems. The point in showing it is that we're seeing an algorithmic pathology in glm() that others don't have.

29. C Ryan King on said:

From the stata manual:

Commands use the
Newton-Raphson method with step halving and special fixups when they
encounter nonconcave regions of the likelihood. For details, see [M-5]
moptimize and [M-5] optimize.

glm.fit is always taking the full newton step. You could very easily modify it to take a fractional step, or as in bfgs do a few points of line search. Seems like stata's optimizer actually checks that the likelihood is behaving as expected.