## Putting together multinomial discrete regressions by combining simple logits

When predicting 0/1 data we can use logit (or probit or robit or some other robust model such as invlogit (0.01 + 0.98*X*beta)). Logit is simple enough and we can use bayesglm to regularize and avoid the problem of separation.

What if there are more than 2 categories? If they’re ordered (1, 2, 3, etc), we can do ordered logit (and use bayespolr() to avoid separation). If the categories are unordered (vanilla, chocolate, strawberry), there are unordered multinomial logit and probit models out there.

But it’s not so easy to fit these multinomial model in a multilevel setting (with coefficients that vary by group), especially if the computation is embedded in an iterative routine such as mi where you have real time constraints at each step.

So this got me wondering whether we could kluge it with logits. Here’s the basic idea (in the ordered and unordered forms):

– If you have a variable that goes 1, 2, 3, etc., set up a series of logits: 1 vs. 2,3,…; 2 vs. 3,…; and so forth. Fit each one with bayesglm (or whatever). The usual ordered logit is a special case of this model in which the coefficients (except for the constant term) are the same for each model. (At least, I’m guessing that’s what’s happening; if not, there are some equivalently-dimensioned constraints.) The simple alternative has no constraints. intermediate versions could link the models with soft constraints, some prior distribution on the coefficients. (This would need some hyperparameters but these could be estimated too if the whole model is being fit in some iterative context.)

– If you have a vanilla-chocolate-strawberry variable, do the same thing; just order the categories first, either in some reasonable way based on substantive information or else using some automatic rule such as putting the categories in decreasing order of frequency in the data. In any case, you’d first predict the probability of being in category 1, then the probability of being in 2 (given that you’re not in 1), then the probability of 3 (given not 1 or 2), and so forth.

Depending on your control over the problem, you could choose how to model the variables. For example, in a political survey with some missing ethnicity responses, you might model that variable as ordered: white/other/hispanic/black. In other contexts you might go unordered.

I recognized that my patchwork of logits is a bit of a hack, but I like its flexibility, as well as the computational simplicity of building it out of simple logits. Maybe there’s some literature on this (perhaps explaining why it’s a bad idea)?? I’d appreciate any comments or suggestions.

1. K? O'Rourke says:

Assume you are aware of the continuation ratio label and literature on this type of model for ordered responses.

K?

2. K? O'Rourke says:

Assume you are aware of the continuation ratio label and literature on this way of approaching ordered responses.

K?

3. Kaiser says:

I believe many practitioners are not impressed by the predictive performance of multinomial logit type models. It's a case of yes you can compute it but try not to look too closely at the results.
Many people then do something similar to what you're suggesting. Turn the multinomial response into a set of binary responses, build a logit model for each response, and predict by maximizing odds across models. The difference is that we don't impose arbitrary order on nominal data (and we ignore the interdependence between the models).
This works reasonably for your chocolate-strawberry-vanilla example. But here's another twist: what if there is a hate-ice-cream category? I'm not convinced that hate-ice-cream should be treated the same way as chocolate-strawberry-vanilla. The problem is even more so if the distribution is highly skewed to hate-ice-cream.
A further twist is counterfactuals. It's easier if everyone is given a choice of choc-straw-van-nothing. What if individuals are only exposed to one or a subset of those choices?

4. MGrote says:

We (Anthropology-related analysts) need practical ways to fit hierarchical multinomial models in order to deal with time-allocation datasets (counts of activity types– like "harvesting" and "child care"– of persons within families). The limitations of existing routines (Stata-GLLAMM, SAS) seem severe– very long computing times for even 3-4 activity codes, and little potential for modeling coordination of work within households. The models seem dauntingly parameter-rich and hard to interpret. But dodgy short-cut approaches don't appeal either.

An unpublished technical report on MCMC methods by Chib et al available at:

is interesting. I talked to Chib about this during a visit– he views the hierarchical multinomial problem as quite hard. Because multinomial counts are dependent themselves, it may be hard to learn about how one family member's activities depend on another member's activities.

5. bill raynor says:

Hastie and Tibshirani published this and there is the IRLS (iteratively reweighted least squares) algorithm in GLIM. Tibshirani also considers it obliquely in a JStatSoft paper on the Lasso

6. Bob Carpenter says:

Reducing categorical classifiers with more than two categories to binary is a running theme in the machine-learning literature. I'm guessing that's because SVMs are intrinsically binary.

John Langford et al. did a tutorial for the 2009 ICML:

and John maintains a whole page of references, including to the ordinal case:

7. ecs says:

HLM6 runs multilevel multinomial logistic regression, and graphs the predicted probabilities for them, but does not output these numbers (as far as I can tell, if you have any insights, please do share!), though using the excel version of spost (Scott Long's program for Stata) one can manually calculate the predicted probabilities.

8. Mike Collins says:

As Bob says, does seem related to work in the machine learning
literature — in particular work on ECOC (error correcting output codes)
for multiclass classification, eg

Reducing Multiclass to Binary:
A Unifying Approach for Margin Classiﬁers

Allwein, Schapire Singer, JMLR

Solving Multiclass Learning Problems via
Error

9. Maarten Buis says:

I am not sure all commenters are talking about the same model, so I first define the way I understand the term multinomial logit or probit model: it is a model for a set of mutually exclusive and exhaustive categories. So to take the vanilla, chocolate, strawberry example, these will be the only flavors available and an observation can fall in only one of these categories. I guess religion (including an "other" and a "none" category) would be easier to fit within that scheme.

In that case the outcome is a set of probabilities that for each observations has to add up to 1. Separate logit models will only constrain the individual probabilities to remain between 0 and 1 but have no way of enforcing the adding up to 1 constraint. In practice the predicted probabilities often add up to something close enough to 1 as long as you remain within the range of the data. That is no longer true as soon as you start extrapolating.

So if you are just describing the data, this trick should work OK. I'd be more careful if the aim is to extrapolate to other situations.

10. Maarten Buis says:

The default way of fitting hierarchical multinomial models in Stata is to use -gllamm- ( http://www.gllamm.org ). This can indeed be very slow (e.g. look for the entry for gllamm in http://www.stata.com/statalist/archive/2011-01/ms… ).

There are however faster alternatives, see:

Peter Haan and Arne Uhlendorff (2006) "Estimation of multinomial logit models with unobserved heterogeneity using maximum simulated likelihood" The Stata Journal, 6(2): 229-245.
http://www.stata-journal.com/article.html?article

11. Kaiser says:

Maarten: good point. To clarify, a typical business problem is which of a set of ads should be shown in order to maximize clicks on the ads. For each user, a decision has to be made by selecting one of the ads so we are more interested in ranking the options than in deriving probabilities. Your point about total probability is valid, and is a reason why I hope there is a better solution. A more serious problem is that those who clicked on Ad x will surely have been shown Ad x while those who didn't click could have been shown any of the Ads.
But in my experience, a multinomial model which does not have the probability issue does worse in prediction.

12. Andrew Gelman says:

Maarten:

The model you describe (three comments up from this one) is not what I'm proposing. If there are 4 options, you're saying to model A vs BCD, B vs ACD, C vs ABD, D vs ABC. Then, as you say, the probabilities won't necessarily add to 1. I'm saying to model A vs BCD, B vs CD, C vs D. Then the probabilities automatically work out.

The obvious issues with my algorithm are (a) I treat A,B,C,D asymmetrically, and (b) my model has lots of parameters that can be difficult to estimate (especially in the C vs D model, where there might not be a lot of data remaining).

I'm not quite sure what to do about (a). Maybe as Kaiser suggests it's not such a problem, if we don't actually want to model the original categories symmetrically anyway. For (b), I'm thinking that an appropriate prior distribution will regularize.

13. K? O'Rourke says:

If you conceptualized A vs BCD, B vs CD, C vs D as three studies to be meta-analysed

You might find some ideas here http://biostatistics.oxfordjournals.org/content/2

(replace quality score with some other feature you might think has an impact to shrink to)

K?

14. Maarten Buis says:

I see what you mean. This model is known under a variety of names. I like "sequential logit model" (Tutz 1991). Other names are "sequential response model" (maddala 1983), "continuation ratio logit" (Agresti 2002), "model for nested dichotomies" (fox 1997), and the "Mare model" (after (Mare 1981)).

I like it for modeling process that can be thought of as a sequence of nested decisions.

Warning: shameless self-promotion starts here:

Within this model you can link the effects of variables during the process and the effects of variables on the final outcome. Within my dissertation I used this model to look at the effects of parental background on making different transitions and how this relates to the effect of parental background on the highest achieved level of education.
http://www.maartenbuis.nl/dissertation/chap_6.pdf

There is an influential critique on this model by Cameron and Heckman (1998). I like (obviously) my own description of it ( http://www.maartenbuis.nl/dissertation/chap_7.pdf ). There will also be a special issue on this critique in Research in Social Stratification and Mobility ( http://www.elsevier.com/wps/find/journaldescripti… ).

References

Agresti, Alan 2002, Categorical Data Analysis, 2nd edition. Hoboken, NJ: Wiley-Interscience.

Cameron, Stephen V. and James J. Heckman, 1998, Life Cycle Schooling and Dynamic
Selection Bias: Models and Evidence for Five Cohorts of American Males. The Journal of Political Economy 106:262–333.

Fox, John 1997, Applied Regression Analysis, Linear Models, and Related Methods. Thousand Oaks: Sage.

Maddala, G.S. 1983, Limited Dependent and Qualitative Variables in Econometrics. Cambridge: Cambridge University Press.

Mare, Robert D. 1981, Change and Stability in educational Stratification, American Sociological Review, 46(1): 72-87.

Tutz, Gerhard 1991, Sequential models in categorical regression, Computational Statistics & Data Analysis, 11(3): 275-295.

15. bill raynor says:

If you "explode" the problem into all possible pairs (A vs B, A vs C, A vs D, B vs. C, B vs D, C vs D) and model the pairs using a Bradley-Terry model, you get reasonable estimates of the logits for each item in the choice set. The number of parameters stays small and allows you to rank the choices and generate probabilities for arbitrary choice sets. The variances have to be adjusted, of course. This is commonly used to handle ranked items (instead of an iterated Plackett-Luce approach).

I think of it in the context of U-statistics, where you generate an estimate by averaging over all admissible tuples. Since the Bradley Terry is intrinsically linear, the tuple here is a choice between items in a pair.

16. Andrew Gelman says:

Bill:

Sure, that's the multinomial logit that I mentioned at the beginning of my blog above. The problem is that the software I have does not easily allow me to estimate this model in a multilevel setting. Also, as noted by some of the commenters above, this model makes some pretty strong assumptions, so it could be useful to have a Bayesian generalization that does soft constraints.

17. Bob Carpenter says:

You could use the Bradley-Terry type model even if you don't have all possible pairs. You could also use it if you have more than one classifier for a pair.

The most direct application would quantize the pairwise decisions into 0/1 outcomes. But you could also try to model the actual classifier predictions directly. This'd be easiest for one kind of classifier, but it probably wouldn't be too difficult to add predictors for the items and/or base classifiers.

18. K? O'Rourke says:

Maarten: My recollection was Stephen Fineberg for continuation ratio label.

Believe that is confirmed here,though Stigler's law always haunts one on these priority claims ;-)

K?

19. Maarten Buis says:

Thanks K?, that sounds plausible. I remember Rob Mare quoting Fineberg as a source for what in my sub-sub-discipline later became known as the "Mare-model"…