## Too many MC’s not enough MIC’s, or What principles should govern attempts to summarize bivariate associations in large multivariate datasets?

Justin Kinney writes:

Since your blog has discussed the “maximal information coefficient” (MIC) of Reshef et al., I figured you might want to see the critique that Gurinder Atwal and I have posted.

In short, Reshef et al.’s central claim that MIC is “equitable” is incorrect.

We [Kinney and Atwal] offer mathematical proof that the definition of “equitability” Reshef et al. propose is unsatisfiable—no nontrivial dependence measure, including MIC, has this property. Replicating the simulations in their paper with modestly larger data sets validates this finding.

The heuristic notion of equitability, however, can be formalized instead as a self-consistency condition closely related to the Data Processing Inequality. Mutual information satisfies this new definition of equitability but MIC does not. We therefore propose that simply estimating mutual information will, in many cases, provide the sort of dependence measure Reshef et al. seek.

For background, here are my two posts (Dec 2011 and Mar 2012) on this method for detecting novel associations in large data sets. I never read the paper in detail but on quick skim it looked really cool to me. As I saw it, the clever idea of the paper is that, instead of going for an absolute measure (which, as we’ve seen, will be scale-dependent), they focus on the problem of summarizing the grid of pairwise dependences in a large set of variables. Thus, Reshef et al. provide a relative rather than absolute measure of association, suitable for comparing pairs of variables within a single dataset even if the interpretation is not so clear between datasets.

At the time, I was left with two questions:

1. What is the value of their association measure if applied to data that are on a circle? For example, suppose you generate these 1000 points in R:

n <- 1000 theta <- runif (n, 0, 2*pi) x <- cos (theta) y <- sin (theta) Simulated in this way, x and y have an R-squared of 0. And, indeed, knowing x tells you little (on average) about y (and vice-versa). But, from the description of the method in the paper, it seems that their R-squared-like measure might be very close to 1. I can’t really tell. This is an interesting to me because it’s not immediately clear what the right answer “should” be. If you can capture a bivariate distribution by a simple curve, that’s great; on the other hand if you can’t predict x from y or y from x, then I don’t know that I’d want a R-squared-like summary to be close to 1. No measure can be all things to all datasets, so let me emphasize that the above is not a criticism of the idea of Reshef et al. but rather an exploration. 2. I wonder if they’d do even better by log-transforming any variables that are all-positive. (I thought about this after looking at the graphs in Figure 4.) A more general approach would be for their grid boxes to be adaptive.

My second post reported some criticisms of the method. Reshef et al. responded in a comment.

In any case, all these methods (including the method discussed in the paper by Simon and Tibshirani) seem like a step forward from what we typically use in statistics. So this all seems like a great discussion to be having. I like how Kinney and Atwal are going back to first principles.

P.S. There was one little thing I’m skeptical of, not at all central to Kinney and Atwal’s main points. Near the bottom of page 11 they suggest that inference about joint distributions (in their case, with the goal of estimating mutual information) is not a real concern now that we are in such a large-data world. But, as we get more data, we also gain the ability and inclination to subdivide our data into smaller pieces. For example, sure, “consumer research companies routinely analyze data sets containing information on ∼ 10^5 shoppers,” but it would be helpful to break up the data and learn about different people, times, and locations, rather than computing aggregate measures of association. So I think “little-data” issues such as statistical significance and efficiency are not going away. Again, this is only a small aside in their paper but I wanted to mention the point.

1. Anonymous says:

congrats on being the only statistician ever to reference The Fugees.

2. Michael Mitzenmacher says:

I just thought I’d mention that, just prior to Kinney and Atwal’s release of their critique on the arxiv, we had posted a new work (http://arxiv.org/abs/1301.6314). I’d encourage people to read both.

I agree that this is a great discussion to be having, and there are lots of potentially interesting things to do in this space. I’m glad other statisticians seem interested. It would help, I think, if some of them were more collegial and accurate in their statements. For example, you say Kinney writes:

“In short, Reshef et al.’s central claim that MIC is ‘equitable’ is incorrect.”

We’re quite clear in our papers that MIC appears equitable in practice over a range of functions, noise types, and noise levels, based on our testing through extensive simulations. We discuss the notion of equitability as a desirable heuristic property, as underscored by our use of words like “roughly equal” and “similar” instead of “equal” when discussing it. Philosophically, we have been using equitability as an approximate property (the closer the statistic comes to equalling R^2 over as large a space of functions/noise models as possible the better). Kinney apparently interprets it as a binary property (the statistic exactly equals R^2, always), but rather than saying this explicitly in his statement, he seems to suggest that our work includes some sort of incorrect proof or false mathematical claim; that is not the case.

Further, we believe that the capabilities of MIC we have demonstrated empirically provide clear evidence of its utility in practice. Our recent post to the arxiv continues in this line, offering simulations that are a strict superset of those conducted by Kinney and give a more thorough comparison between MIC and other methods, including mutual information estimation.

As we say in our new paper, we expect there are many variations on equitability that are useful to calculate, since what is meant by “noise” can vary considerably from application to application. If Kinney and Atwal disagree with ours because they think a different one is more useful in some situations, we think that’s great.

• anon says:

Aren’t those weasel words: “appears equitable”, “heuristic property”, “roughly equal”, “similar”?

If you offer weak / vague definitions of quantities you can never be pushed into a corner? Even if you were absolutely wrong (I’m not saying you are) it’d be impossible to prove it?

What’s a falsifiable test of your thesis? If what you are claiming were indeed wrong, what proof / test would serve that end?

3. Swagatam says:

Dear Andrew and Michael

about Andrew’s P.S. point—Reshef et al., for inexplicable resons, decided that N=5000 is a “large data set”. That in tself wouldn’t have been the problem if claims in Reshel et al. would persist in the limit of large N. As Kinney et al. shows very precisely, the dubious claims made in the paper on MIC’s virtues in cotrast to good old Mutual Information is simply mathematically incorrect.

If your concern is MI estimation, there are MI estimation methods that converge extremely well for N>1000-10000. MI captures, in the most general sense possible, the relationaship between two random variables, because no assumption is made on either the functional forms of the relationship, or the distributions of the variables, in defining MI. In other words, if MI is zero, the variable are indeed mutually independent, no functional dependence can be contrived, however way you try to “subdivide data”. So I am quite confused about your concern.

Michael,

I am not sure what you mean when you say “demonstrated empirically”. Perhaps, you are defining “demonstrated” in the same “philosophical” terms that you define “equitable”? Either you define equitable and prove MIC is equitable by your definition, or you don’t introduce the concept. So I don’t know what you mean by “equitable in practice” and “desirable heuristic property”. Being heuristic in a method is often necessary, as long as the “property” (namely, equitable) is defined mathematically. In the Science paper and the follow up, Rashef et al. chose a set of arbitrary functional relationships that are constructed for demonstration purposes (though these were functions with rather opaque numerical values for parameters). Kinney et al proves MI’s equitability (yes, they were forced to actually define the quantity in a sensible manner, and has tried to guess yours), and working with arbitrary functional relationships shows that the apprent failure of MI and success of MIC claimed in Rashef et al was simply artifactual. I don’t see a need of “philosophical” discussion here.

4. […] how to calculate MIC using R are available at exploredata.net. Here I calculate the MIC for an example (originally here) that Andrew Gelman posed on […]

5. Regarding the MIC of a circle, I was curious and have run MIC in R (see exploredata.net) so I did the calculation and got an MIC of 0.65 for exactly the proposed input. Details: http://planspace.org/2013/02/05/finding-the-mic-of-a-circle/ Conclusion: It isn’t clear that 0.65 is the ‘right’ answer for a circle, but it does provide more indication of a relationship than a correlation coefficient of zero does.

• revo11 says:

mutual information will be non-zero for a circle as well.

It is worth remembering that the mutual information value actually means something precisely: i.e. how many bits of information are shared between x and y variables, at a given resolution.

An MIC value of say, 0.65, has no operational meaning. It is not obvious that the MIC values tell us anything useful at all since many different noise functions will correspond to the same MIC. Simple example: a noiseless straight line and a noisy 2-by-2 checkerboard will both give MIC=1.

6. Jutin Kinney says:

The keystone claim of Reshef et al. is this: MIC is equitable and mutual information is not. Were either of these assertions not true, there would be no motivation for introducing MIC.

The text of their paper, as well as our correspondence with the authors, fully suggests that “equitability” means this: if y = f(x) + noise, a dependence measure D is equitable if and only if the value of D[x;y] is (in the large data limit) determined by the value of R^2[f(x);y]. As stated above, our critique proves that this definition is unsatisfiable.

Dr. Mitzenmacher suggests here that “equitability” was not meant to be a “mathematical property” but rather an “approximate property” or “desirable heuristic”. However, even if the thesis of Reshef et al. were weakened to “MIC is approximately more equitable than mutual information on the specific relationships tested in this paper”, it would still be incorrect.

This is demonstrated in Fig. 2A,B of our critique. For this we replicated the analysis of Reshef et al. using the same underlying functional relationships. When analyzing simulated data sets containing 5000 data points, we found that mutual information tracks R^2[f(x);y] much better than MIC. This pattern appears to hold in the large data limit (Fig. 2C).

It seems that MIC simply assigns larger values to monotonic relationships than to non-monotonic relationships. The coloring scheme in Fig. 2 of our critique makes this clear. By contrast, the lack of systematic coloring in Fig. 2 of Reshef et al. (or in their recent preprint) obscures the non-equitable bias of MIC.

7. […] ADDITIONAL COMMENTS 06.02.13: Justin B. Kinney made me aware of some follow-up work of Gurinder Atwal and his that questions the claimed “equitability” properties of MIC. A discussion of this paper, including a response by Michael Mitzenmacher in the comment thread and counter-response by him, can be found on Andrew Gelman’s blog. […]

8. Thoughts without proof says:

IMO the real problem with MIC is that an optimal grid is needed. However in many cases such an optimal grid simply doesn’t exist at all when the MIC value increases with the number of bins and hence the reported MIC value corresponds to the (user specified) maximum grid size. Increasing this maximum grid size will affect the values for different pairs of variables differently thence giving any comparisons a slightly negative connotation. And not to forget that the sample size seems to have an extreme influence on the MIC values. I’d say that MIC is not worth its computational cost. Maybe the usual scaled information criteria for, say, a quantile/rank based grid yield better results?

9. […] a discussion on Andrew Gelman’s blog Michael Mitzenmacher, one of Reshef et al’s senior authors, seems to try to back out of the […]