Skip to content

Discussion on overfitting in cluster analysis

Ben Bolker wrote:

It would be fantastic if you could suggest one or two starting points for the idea that/explanation why BIC should naturally fail to identify the number of clusters correctly in the cluster-analysis context.

Bob Carpenter elaborated:

Ben is finding that using BIC to select number of mixture components is selecting too many components given the biological knowledge of what’s going on. These seem to be reasonable mixture models like HMMs for bison movement with states corresponding to transiting and foraging and resting, with the data (distance moved and turning angle) being clearly multimodal.

First (this is more to Christian): Is this to be expected if the models are misspecified and the data’s relatively small?

Second (this one more to Andrew): What do you recommend doing in terms of modeling? The ecologists are already on your page w.r.t. adding predictors (climate, time of day or year) and general hierarchical models over individuals in a population.

Number of components isn’t something we can put a prior on in Stan other than by having something like many mixture components with asymmetric priors or by faking up a Dirichlet process a la some of the BUGS examples. I’ve seen some work on mixtures of mixtures which looks cool, and gets to Andrew’s model expansion inclinations, but it’d be highly compute intensive.

X replied:

Gilles Celeux has been working for many years on the comparison between AIC, BIC and other-ICs for mixtures and other latent class models. Here is one talk he gave on the topic. With the message that BIC works reasonably well for density estimation but not for estimating the number of clusters. Here is also his most popular paper on such information criteria, including ICL.

I am rather agnostic on the use of such information criteria as they faiL to account for prior information or prior opinion on what’s making two components distinct rather than identical. In that sense I feel like the problem is non-identifiable. If components are not distinguishable in some respect. And as a density estimation problem, the main drawback in having many components is an increased variability. This is not a Bayesian/frequentist debate, unless prior inputs can make components make sense. And prior modelling fights against over-fitting by picking priors on the weight near zero (in the Rousseau-Mengersen 2012 sense).

And then I wrote:

I think BIC is fundamentally different from AIC, WAIC, LOO, etc, in that all those other things are estimates of out-of-sample prediction error, while BIC is some weird thing that under certain ridiculous special cases corresponds to an approximation to the log marginal probability.

Just to continue along these lines: I think it makes more sense to speak of “choosing” the number of clusters or “setting” the number of clusters, not “estimating” the number of clusters, because the number of clusters is not in general a Platonic parameter that it would make sense to speak of estimating. I think this comment is similar to what X is saying, just in slightly different language (although both in English, pas en français).

To put it another way, what does it mean to say “too many components given the biological knowledge of what’s going on”? This depends on how “component” is defined. I don’t mean this as a picky comment: I think this is fundamental to the question. To move the discussion to an area I know more about: suppose we want to characterize voters. We could have 4 categories: Dem, Rep, Ind, Other. We could break this down more, place voters on multiple dimensions, maybe identify 12 or 15 different sorts of voters. Ultimately, though, we’re each individuals, so we could define 300 million clusters, one for each American. It seems to me that the statement “too many components” has to be defined with respect to what you will be doing with the components. To put it another way: what’s the cost to including “too many” components? Is the cost that estimates will be too noisy? If so, there is some interaction between #components and the prior being used on the parameters: one might have a prior that works well for 4 or 5 components but not so well when there are 20 or 25 components.

Actually, I can see some merit to the argument that there can just about never be more than 4 or 5 clusters, ever. My argument goes like this: if you’re talking “clusters” you’re talking about a fundamentally discrete process. But once you have more than 4 or 5, you can’t really have real discreteness; instead things slide into a continuous model.

OK, back to the practical question. Here I like the idea of using LOO (or WAIC) in that I understand what it’s doing: it’s an estimate of out-of-sample prediction error, and I can take that for what it is.

To get to the modeling question: if Ben is comfortable with a model with, say, between 3 and 6 clusters, then I think he should just fit a model with 6 clusters. Just include all 6 and let some of them be superfluous if that’s what the model and data want. One way to keep the fitting under control is to regularize a bit by putting strong priors on the weights on the mixtures, so that mixture components 1, 2, etc, are large in expectation large, and later components are smaller. You can do this with an informative Dirichlet prior on the vector of lambda parameters. I’ve never tried this but it seems to me like it could work.

Also–and I assume this is being already but I’ll mention just in case–don’t forget to put informative priors for the parameters in each mixture component. I don’t know the details of this particular model, but, just for example, if we are fitting a mixture of normals, it’s important to constrain the variances of the normals because the model will blow up with infinite likleihood at points where any variance equals zero. The constraint can be “soft,” for example lognormal priors on scale parameters, or a hierarchical prior on the sale parameters with a proper prior on how much they vary. The same principle applies to other sorts of mixture models.

And Aki added:

If the whole distribution is multimodal it is easier to identify the number of modes and say that these correspond to clusters. Even if we have “true” clusters, but they are overlapping so that there are no separate modes, the number of clusters is not well identified unless we have lot of information about the shape of each cluster. Example: using mixture of Gaussians to fit Student t data -> when n->infty, the number of components (clusters) goes to infty. Depennding on the amount of model misspecification and separability of clusters we may not be able to identify the number of clusters no matter which criteria we use. In simulated examples with true small number of clusters, use of criteria which favors small number of clusters is likely to perform well (LOO (or WAIC) is likely to favor more clusters than marginal likelihood, BIC or WBIC). In Andrew’s voters example, and in many medical examples I’ve seen, there are no clear clusters as the variation between individuals is mostly continuous or discrete in high dimensions.


  1. Years ago when I worked in speech recognition I encountered the same issue with HMMs using observation densities that were a mixture of Gaussians. The problem stems from the fact that the observations (feature vectors obtained by some signal processing of the acoustic signal) are not actually independent conditional on the Markov chain state, as assumed by the model. Treating them as conditionally independent exaggerates the strength of the evidence provided by the observations.

  2. Anonymous says:

    re: putting a prior on the number of mixture components (especially if controlling the number of components is a concern)

    i don’t know if these models are implementable in Stan, but I think some of them are already implemented in Matlab

    • Dirichlet processes (DP) require varying numbers of parameters per iteration, whereas Stan requires a fixed set of parameters. What you can do in Stan is use a Dirichlet with a large K and small prior count (much less than 1). We haven’t really tried that anywhere other than in replicating the BUGS example for testing. If you choose a large K, it’s kind of like doing what Andrew is suggesting if someone has a fixed upper bound on the possible number of clusters in mind—just use that many clusters.

      There are lots of DP examples implemented in different packages—NIPS is full of them. I don’t know how well they work. In a lot of the mixture models where I’ve seen them applied, typically generalizations of LDA or HMMs, you can’t do full Bayesian inference anyway, because you can’t explore enough of the multimodal posterior. So you’re going to run into a mode anyway, so there’s no way you’re actually going to be able to infer the “correct” number of clusters. The same thing holds for a large-K Dirichlet approximation to a DP

  3. Christian Hennig says:

    Not sure who’s the Christian Bob Carpenter is referring to?

    Anyway, here are some comments:
    1) Let’s say we assume a Gaussian mixture model. There is some theory that suggests that the BIC can estimate the number of mixture components consistently. (Actually this theory comes with *very* restrictive conditions, see Keribin (Sankhya, 2000), but let’s just assume it holds in a more general setting than it was prved for.) Another thing is that pretty much every distribution can be approximated arbitrarily well by a mixture of (usually very many) Gaussians. This means that if we’re in a situation in which there are, say, k clearly separated “clusters” in the dataset and the distributions within clusters are not exactly Gaussian, the BIC, for large datasets, will fit *each* cluster by a mixture of several Gaussians, because this will approximate the distribution better than using single ones. This is particularluy a problem for *large* datasets. It means that if you identify the estimated number of clusters with the estimated number of Gaussian mixture components, the BIC is bad *because* it’s consistent. For small datasets the problem isn’t as bad because the penalisation in the BIC will stop you from fitting too many mixture components on too small data subsets (as long as you avoid “spurious clusters”, that is).
    Same arguments should hold for mixtures of most other distribution families.

    2) This doesn’t mean that the BIC doesn’t do what it’s supposed to do (pick a mixture that approximates the data distribution well); it rather means that the identification of clusters with mixture components is problematic. You can easily construct a mixture of two or more Gaussians that looks like a single “cluster”.

    3) All this depends on your definition what a “cluster” actually is (the most intuitive one when looking at Euclidean data seems to be to identify clusters with clear density modes, but this isn’t without problems). There is no unique definition, and even the same dataset allows for different definitions that may be useful in different applications. One example is that in some applications large within-cluster distances should not occur, which often is in direct conflict with the density mode concept of clustering; observations that are very far from each other can easily belong to the same density mode in the sense of hill-climbing.

    4) To me approaches are suspicious that try to solve the problem by still fitting the same kind of mixture model and identifying the number of mixture components with the number of clusters, only using some kind of device that artificially keeps the number of cluster down (like a prior or a tougher penalty such as ICL’s one). Although these can give satisfactory solutions in many situations, they avoid to address the issue that the problem is not really that the BIC gives us too many mixture components, but rather that the “clusters” look different from single mixture components (as far as we want “clusters” to be more flexible than strictly Gaussian).

    Here are some related things I wrote:
    Methods for merging Gaussian mixture components (i.e., to get mixtures of mixtures so that the resulting mixtures can be interpreted as “clusters”; note that there’s a similar paper by Baudry, Raftery, Celeux, Gottardi and Lo at about the same time, but I think mine has more discussion of the underlying conceptual issues)

    What are the true clusters?

    Clustering Strategy and Method Selection
    This is also a chapter in the “Handbook of Cluster Analysis”.

    I agree broadly with Andrew that estimating the “number of clusters” is not really a well defined problem (as I make clear in the cited papers), although it can be defined in a way that makes it manageable. The thing is that this then becomes a more specialised problem that will not cover all kinds of ideas that people have what a “cluster” should be. So the key thing is that a restrictive and application-adapted definition of a cluster has to be *decided* (this cannot be estimated from data) before the problem can be satisfactorily solved.

Leave a Reply