When is there “hidden structure in data” to be discovered?

Michael Collins sent along the following announcement for a talk:

Fast learning algorithms for discovering the hidden structure in data

Daniel Hsu, Microsoft Research

11am, Wednesday April 10th, Interschool lab, 7th floor CEPSR, Columbia University

A major challenge in machine learning is to reliably and automatically discover hidden structure in data with minimal human intervention. For instance, one may be interested in understanding the stratification of a population into subgroups, the thematic make-up of a collection of documents, or the dynamical process governing a complex time series. Many of the core statistical estimation problems for these applications are, in general, provably intractable for both computational and statistical reasons; and therefore progress is made by shifting the focus to realistic instances that rule out the intractable cases. In this talk, I’ll describe a general computational approach for correctly estimating a wide class of statistical models, including Gaussian mixture models, Hidden Markov models, Latent Dirichlet Allocation, Probabilistic Context Free Grammars, and several more. The key idea is to exploit the structure of low-order correlations that is present in high-dimensional data. The scope of the new approach extends beyond the purview of previous algorithms; and it leads to both new theoretical guarantees for unsupervised machine learning, as well as fast and practical algorithms for large-scale data analysis.

I won’t have a chance to attend, but I had the following question or thought (beyond the question of why he capitalized the words Hidden, Latent, Allocation, Probabilistic, Context, Free, and Grammars, but not mixture, models, algorithm, or data analysis). My thought had to do with the potential application of these ideas to political science. The key seems to be in the title: “the hidden structure in data.”

Some problems don’t have much hidden structure: for example, if you try to model vote choice or public opinion given individual and group-level variables, you get pretty much what you’d expect: liberals mostly vote for Democrats, conservatives mostly vote for Republicans, demographic variables such as age, education, marital status, income, and religion predict in the way you might imagine, and so forth. And these predictors are related to each other in various expected ways. So I don’t think there’s any structure of low-order correlations to be found. There are some subtleties in the data (for example, the notorious interaction between individual and state income in predicting vote), but it’s hard to imagine that the methods described above would be good tools for studying these patterns.

But other data in political science are indeed characterized by low-dimensional structure. The most prominent example is legislative roll-call voting, but there should be lots of other examples. For instance, anything having to do with international relations or trade will be structured by the patterns of connections between countries. Or personal networks, for example a study of covert terrorist organizations using external data, or a model of hidden links between financial organizations.

I wonder if the methods discussed by Hsu in his talk come with guidelines for when they can most effectively be applied. Or maybe he happens to focus on applications with hidden structure?

8 thoughts on “When is there “hidden structure in data” to be discovered?

  1. Most of the applications I’ve seen are in natural language processing. There is a huge amount of coarse and fine-grained latent structure in a natural language document, reflected, for instance, in topical structure, discourse type (conversation vs. fiction vs. essays), mode of presentation (written, spoken, texted, …) and a grab bag of writing style structure (formal/informal, education level, localization of word choice, etc.). There is structure from broad discourse-level organization of an argument or essay to the fine-grained choice of word forms and word order in a sentence.

    The guidelines for these spectral methods that I’ve seen characterize the type and degree of separation of data required to guarantee reconstruction of the underlying cluster structure. For instance, in the LDA apps I’ve seen, there must be words that show up in only a single topic. Evaluation is almost always on held-out performance on benchmark problems (which are no longer held out given everyone works on them, but that’s another story). Rarely do people in the machine learning world worry about model fit or interpreting features. (Though there’s a bit of a cottage industry involved in visualization and labeling of clustering results.)

    For voting models, you could start by running some kind of factor analysis over all of the predictors in the national election survey. It might pop out some interesting relations among the survey questions and among the respondents, as well as their relation.

    The capitalization thing is due to the confusion between proper names and common nouns in English typography (not among the speakers/writers, but in the language itself). For me, “Gaussian mixture model” and “probabilistic context free grammar” are both descriptive, so I wouldn’t capitalize either. But if you think of these terms as the proper name of a method with a life of its own, then it would get capitalized. As consumer products move from branded to generic, they get uncapitalized. Names and their derived adjectives tend to be capitalized, until they get very common, like “boolean”. Common nouns are not capitalized unless they get promoted to product names, like “Apple”. I don’t know why Gaussian is still usually upper case — let’s give him the respect we give Boole!

    I plan to attend the talk.

      • +1. I started doing this a lot too after learning German.

        Pleasantly surprised that someone else shares this quirk. My adviser / people proofreading my work were often annoyed as to why I capitalize stuff I shouldn’t be capitalizing.

  2. Suppose you’re trying to model public opinion. At the top level, there is a fuzzy Republican cluster (anti-abortion, anti-gay, anti-global warming, etc.) and a diametrically opposing Democratic cluster. At a higher resolution, there are smaller clusters. For example, Hispanic Catholics, which mostly vote for Democrats but, at least as of 5 years ago, were against gay marriage by a 2 to 1 margin. Mormons, while generally similar to white Evangelicals in most aspects, tend to be significantly different from them with regard to their attitudes towards immigrants, and there’s a 15-point gap between Mormons and Evangelicals on the statement “It’s best for the future of our country to be active in world affairs”.

    Now suppose you have a data set with voter preferences, but it does not have “Mormons” marked as such. Or you have a hugely high-dimensional data set like GSS, with 500 parameters, and you’re not sure a priori which ones are relevant for you. That’s where hidden structure methods would help.

    But, as Bob says above, these methods are most commonly applied to linear data, in spheres like speech processing, etc.

  3. I wonder if the limits on finding structure in vote choice and public opinion data are the result of impoverished data.

    For example, if we have individual records for a few hundred thousand people, including answers to 50 or so opinion questions, votes (or not) in the last few elections, census tract, car makes and models, place of birth, etc. we might find a lot of interesting (low dimensional) structure. Looking at relatively simple aggregate categories I’m not surprised we don’t.

    Most (all?) of the interesting machine learning results involve big fine-grained data sets. I think that’s a requirement for finding good latent structure, given current technology.

  4. This is more applicable if you have high dimensional dataset with underlying structure! However, if the correlation exists among the variables is quite small it is highly unlikely to find any hidden or underlying structure in the dataset and unlikely to represent your high-dimensional dataset by low dimensional latent variables.

Comments are closed.