Jonathan Hughes writes:

I am an engineering doctoral student. As part of my dissertation I’m proposing a mode of adaptation for a predictive system to individual subgroup specific streams of data which come each from a specific subgroup of a mixture population distribution. As part of the proposal presentation someone referenced your work and believed that you may have address the problem described below. I have read many of your academic writings and I don’t know if it is the case, and I haven’t been able to find it.

I will explain the problem briefly:

Let M_p be a logistic regression model that assumes a single homogeneous population logit(pi) = beta + beta_1*x + noise, but where there are latent subgroups in the population with varying distributions (but same in form), i.e. the true case is modeled by

M_s := logit(pi) = beta_pop + beta_subgroup*indicator + beta_1*x + beta_1_subgroup*indicator*x + noise;

What is the expected gain in ROC* area under the curve, AUC, from including the subgroup information? i.e. what is E[AUC(M_s) – AUC(M_p)], under some reasonable assumptions? I would like to incorporate some theoretical results about this, either those I have been deriving myself or others’ with priority.

My reply: This looks like a varying-intercept, varying-slope logistic regression of the sort that is described in various places including my book with Jennifer Hill, with the twist that the groups are unknown. I have no results on area under the curve or the ROC curve more generally, so I suggest you explore this using fake-data simulation. For your data at hand, you can evaluate how much gain you’re getting by using leave-one-out cross-validation.

This reminds me of a finite mixture model. In which case, a GMM approach is the following:(Fox, Kim, Ryan, Bajari; 2011) http://fox.web.rice.edu/published-papers/fox-kim-ryan-bajari-qe.pdf

No idea about theoretical results though.

I’ve thought about theoretical results but I think you’d have to make too many assumptions for it to be at all useful. I think of AUC as a measure of how overlapping two distributions are. If there is no overlap in the distributions then AUC is 1 and if there is total overlap AUC is .5. So first assumption you’d have to make is the distributions of both distributions. Suppose they are normal(0,1) and normal(1,1) for instance; then you’d get an AUC of ~.77. Better models decrease the standard deviation or reduce the difference in means of the distributions so you’d have to assume how much better your new data would be. So you have to 1) assume distributions to classes and 2) have an expected increase in predictive power with new features. But there’s no theoretical way of determining either without trying it out…

They might consider a latent-class analysis to identify the groups and follow that with LR on the outcome of interest.