# Computational problems with glm etc.

John Mount provides some useful background and follow-up on our discussion from last year on computational instability of the usual logistic regression solver.

Just to refresh your memory, here’s a simple logistic regression with only a constant term and no separation, nothing pathological at all:

```> y <- rep (c(1,0),c(10,5)) > display (glm (y ~ 1, family=binomial(link="logit"))) glm(formula = y ~ 1, family = binomial(link = "logit")) coef.est coef.se (Intercept) 0.69 0.55 --- n = 15, k = 1 residual deviance = 19.1, null deviance = 19.1 (difference = 0.0)```

And here’s what happens when we give it the not-outrageous starting value of -2:

```> display (glm (y ~ 1, family=binomial(link="logit"), start=-2)) glm(formula = y ~ 1, family = binomial(link = "logit"), start = -2) coef.est coef.se (Intercept) 71.97 17327434.18 --- n = 15, k = 1 residual deviance = 360.4, null deviance = 19.1 (difference = -341.3) Warning message: glm.fit: fitted probabilities numerically 0 or 1 occurred```

As I discussed in the comments to my original post, there were reasons we were running glm with prespecified starting points—this blow-up occurred originally in a real problem and then I stripped it down to get to this simple and clear example.

Mount explains what’s going on:

From a theoretical point of view the logistic generalized linear model is an easy problem to solve. The quantity being optimized (deviance or perplexity) is log-concave. This in turn implies there is a unique global maximum and no local maxima to get trapped in. Gradients always suggest improving directions. However, the standard methods of solving the logistic generalized linear model are the Newton-Raphson method or the closely related iteratively reweighted least squares method. And these methods, while typically very fast, do not guarantee convergence in all conditions. . . .

The problem is fixable, because optimizing logistic divergence or perplexity is a very nice optimization problem (log-concave). But most common statistical packages do not invest effort in this situation.

Mount points out that, in addition to patches which will redirect exploding Newton steps, “many other optimization techniques can be used”:

– EM (see “Direct calculation of the information matrix via the EM.” D Oakes, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 1999 vol. 61 (2) pp. 479-482).
Or you can try to solve a different, but related, problem: “Exact logistic regression: theory and examples”, C R CR Mehta and N R NR Patel, Statist Med, 1995 vol. 14 (19) pp. 2143-2160.

There’s also the new algorithm of Polson and Scott, which looks pretty amazing, I should really try it out on some examples and report back to all of you on this.

Maybe we should also change what we write in Bayesian Data Analysis about how to fit a logistic regression.

## 5 thoughts on “Computational problems with glm etc.”

1. Along with EM algorithms there are the more general MM algorithms. Essentially for logistic regression you can substitute the usual weights muhat(1-muhat) with anything that is strictly larger (in this sense you don’t need to view Polson and Scott as EM at all – at least in the case of logistic regression – which kind of nice since the “missing data” interpretation is a little strained in the context of their DA scheme). Another option is to replace muhat(1-muhat) with 1/4, which means you don’t have to invert X’WX at every step.

http://sites.stat.psu.edu/~dhunter/papers/mmtutorial.pdf is a nice intro and works through logistic regression in 6.2.

• Jared, thanks for the MM reference. Not only does it fix up the Newton-Raphson steps, it supplies the theoretical tools to answer the last question I had (can the first Newton step increase, the answer is no because at zero the logistic model is majorized by its own truncated Taylor series). I have summarized a bit of the MM tutorial here: http://www.win-vector.com/blog/2012/10/rudie-cant-fail-if-majorized/

• Glad it was helpful! Your blog post is very nice (doesn’t link back here though!). Your point toward the end is a good one; if your majorizing function is to do something simple like “clip” the N-R weights you ought to be able to get something close to the locally quadratic behavior while guarding against some instability (MM, like EM, tends to converge at a roughly linear rate near a solution).

2. Hi Andrew,

Didn’t want to clog your inbox, but Nick and I are happy to send along the code for Table 2 in the paper you linked.

3. Extremely useful blog post!

A lazy newbie question follows: as I understand the problem, if the glm function in R would allow some options to be passed to the underlying optimizer, it would be more useful in these kinds of cases. Already, giving the user a chance to choose the optimization algorithm and the stepsize would seem to do a lot. But this isn’t possible at the moment, no?