The maximal information coefficient

Justin Kinney writes:

I wanted to let you know that the critique Mickey Atwal and I wrote regarding equitability and the maximal information coefficient has just been published.

We discussed this paper last year, under the heading, Too many MC’s not enough MIC’s, or What principles should govern attempts to summarize bivariate associations in large multivariate datasets?

Kinney and Atwal’s paper is interesting, with my only criticism being that in some places they seem to aim for what might not be possible. For example, they write that “mutual information is already widely believed to quantify dependencies without bias for relationships of one type or another,” which seems a bit vague to me. And later they write, “How to compute such an estimate that does not bias the resulting mutual information value remains an open problem,” which seems to me to miss the point in that unbiased statistical estimates are not generally possible and indeed are often not desirable.

Their criticisms of the MIC measure of Reshef et al. (see above link to that earlier post for background) may well be reasonable, but, again, there are points where they (Kinney and Atwal) may be missing some possibilities. For example, they write, “nonmonotonic relationships have systematically reduced MIC values relative to monotonic ones” and refer to this as a “bias.” But it seems to me that nonmonotonic relationships really are less predictable. Consider scatterplots A and B of the Kinney and Atwal paper. The two distributions have the same residual error sd(y|x), but in plot B (the nonmonotonic example) sd(x|y) is much bigger. Not that sd is necessarily the correct measure—in my earlier post, I asked what would be the appropriate measure of association between two variables whose scatterplot looked like a circle (that is, y = +/- sqrt(1-x^2)). More generally, I fear that Kinney and Atwal could be painting themselves in a corner if they are defining the strength of association between two variables in terms of the distribution of y given x. I’m not so much bothered by the asymmetry as by the implicit dismissal of any smoothness in x. One could, for example, consider a function where sd(y|x)=0, that is, y is a deterministic function of x, but in a really jumpy way with lots of big discontinuities going up and down. This to me would be a weaker association than a simple y=a+bx.

The two papers under discussion differ not just in their methods and assumptions but in their focus. My impression of Reshef et al. was that they were interested in quickly summarizing pairwise relations in large sets of variables. In contrast, Kinney and Atwal focus on getting an efficient measure of mutual information for a single pair of variables. I suppose that Kinney and Atwal could apply their method to a larger structure in the manner of Reshef et al., and I’d be interested in seeing how it looks.

I’d also be interested in a discussion of the idea that the measure of dependence can depend on the scale of discretization, as discussed in my earlier post.

In any case, lots of good stuff here, and I imagine that different measures of dependence could be useful for different purposes.

1. Zach says:

‘Too Many MCs, Not Enough MICs’ is a great heading! Do you have a favorite heading you’ve written off the top of your head?

• Andrew says:

This is one of my favorites.

• Zach says:

That one has a great thematic link (I needed to google a bit to get the reference). I’m partial to the wordplay of this one though. I’m pleasantly surprised by this dimension of your cultural interests.

2. David says:

Slightly related, a new measure of dependence was recently proposed in NIPS (The Randomized Dependence Coefficient):

http://arxiv.org/abs/1304.7717

3. Dan Wright says:

On Andrew’s point about functions that jump around, and that for at least many applications there being no right answer for what measures might be appropriate, I tried to see if the MIC values looked like what I might guess from scatter plots. For x from 0 to 10, y = sin(ax)+b N, where a is 1,10,50, and 100, b is .5,1, and 1, and N is Normal(0,1), I looked at what minerva:::mine (in R) gave for MIC and put this with the plot. MIC is higher for sin(10x) than sin(x). Here is the code in case others want to see (sorry the mine function is slow … the algorithm in David’s post is faster).

install.packages(“minerva”)
library(minerva)
par(mfrow=c(3,4))
x <- runif(10000,0,10)
y0a <- sin(x) + rnorm(10000)/2
y0b <- sin(x) + rnorm(10000)
y0c <- sin(x) + 2*rnorm(10000)
y1a <- sin(10*x) + rnorm(10000)/2
y1b <- sin(10*x) + rnorm(10000)
y1c <- sin(10*x) + 2*rnorm(10000)
y2a <- sin(50*x) + rnorm(10000)/2
y2b <- sin(50*x) + rnorm(10000)
y2c <- sin(50*x) + 2*rnorm(10000)
y3a <- sin(100*x) + rnorm(10000)/2
y3b <- sin(100*x) + rnorm(10000)
y3c <- sin(100*x) + 2*rnorm(10000)

plot(x,y0a,pch='.',main=round(mine(x,y0a)\$MIC,3))
plot(x,y1a,pch='.',main=round(mine(x,y1a)\$MIC,3))
plot(x,y2a,pch='.',main=round(mine(x,y2a)\$MIC,3))
plot(x,y3a,pch='.',main=round(mine(x,y3a)\$MIC,3))
plot(x,y0b,pch='.',main=round(mine(x,y0b)\$MIC,3))
plot(x,y1b,pch='.',main=round(mine(x,y1b)\$MIC,3))
plot(x,y2b,pch='.',main=round(mine(x,y2b)\$MIC,3))
plot(x,y3b,pch='.',main=round(mine(x,y3b)\$MIC,3))
plot(x,y0c,pch='.',main=round(mine(x,y0c)\$MIC,3))
plot(x,y1c,pch='.',main=round(mine(x,y1c)\$MIC,3))
plot(x,y2c,pch='.',main=round(mine(x,y2c)\$MIC,3))
plot(x,y3c,pch='.',main=round(mine(x,y3c)\$MIC,3))

• Ben says:

Odd. Maybe the increase is a grid artefact?

For those that don’t feel like waiting: http://dl.dropboxusercontent.com/u/17357243/MIC.png
And here is how our pet behaves on these: http://dl.dropboxusercontent.com/u/17357243/A.png

Details here: http://arxiv.org/abs/1303.1828 and R package here: http://cran.r-project.org/web/packages/matie/index.html

Add this to the above code to try it out on that example:

install.packages(“matie”)
library(matie)
plot(x,y0a,pch=’.’,main=round(ma(cbind(x,y0a))\$A,3))
plot(x,y1a,pch=’.’,main=round(ma(cbind(x,y1a))\$A,3))
plot(x,y2a,pch=’.’,main=round(ma(cbind(x,y2a))\$A,3))
plot(x,y3a,pch=’.’,main=round(ma(cbind(x,y3a))\$A,3))
plot(x,y0b,pch=’.’,main=round(ma(cbind(x,y0b))\$A,3))
plot(x,y1b,pch=’.’,main=round(ma(cbind(x,y1b))\$A,3))
plot(x,y2b,pch=’.’,main=round(ma(cbind(x,y2b))\$A,3))
plot(x,y3b,pch=’.’,main=round(ma(cbind(x,y3b))\$A,3))
plot(x,y0c,pch=’.’,main=round(ma(cbind(x,y0c))\$A,3))
plot(x,y1c,pch=’.’,main=round(ma(cbind(x,y1c))\$A,3))
plot(x,y2c,pch=’.’,main=round(ma(cbind(x,y2c))\$A,3))
plot(x,y3c,pch=’.’,main=round(ma(cbind(x,y3c))\$A,3))

• Ben says:

I should also point out that MIC’s 0.08s need to be interpreted relative to its “0” (the score it gives for totally independent data), which is around 0.061 for N=10000. It decreases slowly with the sample size – not an attractive feature.

4. Thanks for discussing this paper, Andrew. However, I do disagree with much of what you say here.

Our paper directly disputes the main claims of Reshef et al., methodically dismantles their argument, and exposes and the artifactual evidence presented in its favor. We also show that the original questions proposed by Reshef et al. have natural and practical answers in information theory — a point completely missed in their original paper.

Look, we mathematically prove that Reshef et al.’s definition of statistical “equitability” cannot be satisfied; we replace it with an alternative definition that is naturally satisfied by mutual information; we redo their simulations and show that their primary evidence was artifactual; we show that MIC has much worse statistical power than mutual information estimates; we show that MIC is *much harder* to estimate than mutual information; we even show that most of the time MIC is literally just a bad estimate of mutual information (since the normalization constant is 1.0).

Finally, I want to emphasize this critical point: Estimating mutual information does *exactly* what MIC was designed to do. In no way is MIC more appropriate for large data sets. Basically, there is no good reason to use MIC. It is a silly statistic that is polluting people’s work.

• Andrew says:

Justin:

I’m confused. What specifically in my post did you disagree with? I thought I was presenting your paper in a positive light!

• Please don’t get me wrong, your post isn’t unkind to our paper at all. And I do appreciate you blogging about this. I just respectfully disagree with many of the points you make, and/or the relevance of those points to the key issues at hand.

In particular, I disagree with the statement that our paper and Reshef et al. “differ not just in their methods and assumptions but in their focus.” Our paper is not orthogonal to Reshef et al.; rather, we confront the central tenants of their work, and conclude that they were wrong.

Lots of people have been using MIC, and I think it’s important for them to fully appreciate the severity of our claims.

• seth says:

Hey Justin, nice paper. I’m curious whether you considered comparing against HSIC (Gretton et al). I know there’s been some work comparing it to dCor, and I’ve run some tests myself (adapting Simon and Tibshirani’s code), which were quite promising.

• Andrew says:

Justin:

What are the many points that I made that you disagreed with? Again, the goal here is not to pick a fight but to clarify the discussion. I wrote that the two papers differ in their focus because the central example in the Reshef et al. paper was a highly multivariate example where they looked at many bivariate relations at once, whereas you focus on just looking at a single bivariate relation. I did not say (or mean to imply) that your paper did not confront Reshef et al., just that you had a different focus.

• Hi Andrew,

I think this notion of MIC being designed for large multivariate settings is a common misconception.

The substance of Reshef et al. (2011) is entirely about bivariate (not multivariate) measures of dependence. Indeed, “equitability” is defined only for bivariate measures of dependence, MIC itself is just a bivariate measure of dependence, and in fact all of the tools in the MINE suite described in the main text are applicable only to bivariate relationships. This focus on bivariate measures is also highlighted in the accompanying Perspective by Terry Speed (https://www.sciencemag.org/content/334/6062/1502).

It is true that, in applications to real data, the authors consider how to analyze large multivariate relationships — but in doing so they just break these in to a large compendium of bivariate relationships. There is nothing inherently multivariate about their analysis.

This emphasis of Reshef et al. on multivariate data sets just reflects how they chose to package the story about MIC and MINE. The only reason our paper seems more focused on bivariate relationships is because it is written in a more straight-forward manner.

-Justin

• Andrew says:

Justin:

So, just to be clear, you don’t disagree with anything else in my post, just the remark that Reshef et al. focus on using their measure in a multivariate context and you focus on the bivariate setting?

• Sorry, but no. I have posted a point-by-point critique of this post below.

5. David Reshef says:

Andrew, thanks for your post and for your continuing interest in this topic — we think the issues you bring up are interesting and valuable to discuss. Readers may not be surprised to hear that we disagree with much of what Kinney and Atwal’s paper has to say, both in the way it misrepresents our earlier work on MIC, and in the empirical comparisons with mutual information. We’ve been preparing two new papers about the theoretical and empirical properties of MIC, mutual information, and equitability that we hope will be clarifying for the community as this important conversation continues to develop.

• Dear David,

If you believe that Mickey Atwal and I have misinterpreted your work, I encourage you to submit a letter to PNAS describing your concerns so that we can have a systematic discussion about these issues. The sooner this important matter has a meaningful hearing out in the open, the better.

Others might not realize, however, that we presented our critique to you and your colleagues over two years ago, soon after your 2011 paper was published. Although we did exchange some emails thereafter, Mickey and I never received what we considered to be a substantive response to our criticisms. In particular, the preprint (http://arxiv.org/abs/1301.6314) posted by you and your coauthors in January of 2013 contains no substantive rebuttal of our main points.

It would have been optimal to be notified of any other concerns you might have prior to the publication of our paper in PNAS; our preprint was on the arXiv for over a year. However, if you do have specific points to make regarding our paper, Mickey and I would still appreciate your feedback.

Sincerely,
-Justin

• David Reshef says:

Hi Justin,

As you wrote, we did indeed address and respond to your claims in 2011. Our forthcoming papers are not designed to respond directly to your current claims, but since they’re theoretical and applied developments of equitability/the MIC methodology, we believe they’ll also clarify the relationship of your work to ours and in so doing address your concerns.

We may also consider writing to PNAS about your new paper, but if so we want to ensure that our letter helps move the field forward and keeps the dialog constructive and civil.

Sincerely,
-Dave

6. Anonymous says:

“One could, for example, consider a function where sd(y|x)=0, that is, y is a deterministic function of x, but in a really jumpy way with lots of big discontinuities going up and down. This to me would be a weaker association than a simple y=a+bx.”

isn’t this an argument over a definition rather than something that can be said to be “true” or “false” in any absolute sense?

7. Hi Andrew,

Because you expressed a wish to know my opinion in detail, I provide here a point-by-point critique of your blog post.

I want to emphasize that I do not wish to be combative, and I apologize if the tone of this critique is harsher than would be ideal. But I believe it is important for your readers to have a clear understanding of what our paper says, and I think that your blog post muddles many of the key issues. So please view these criticisms in the most constructive and collegial light — that is my intention in writing this.

Sincerely,
-Justin

– “Kinney and Atwal’s paper is interesting, with my only criticism being that in some places they seem to aim for what might not be possible.”

This is a major criticism, and I see no justification for it. All of our mathematical claims we prove. All of our computational claims we demonstrate using simulations. What else would our paper need to do to dispel your uncertainty?

– “For example, they write that “mutual information is already widely believed to quantify dependencies without bias for relationships of one type or another,” which seems a bit vague to me.”

Of course this is vague. The sentence you cite occurs in the introduction in the place where we argue that the *current* understanding of mutual information’s equitability is vague, and that *in this paper* we formalize this notion. Is this context not clear from the text?

– “And later they write, “How to compute such an estimate that does not bias the resulting mutual information value remains an open problem,” which seems to me to miss the point in that unbiased statistical estimates are not generally possible and indeed are often not desirable.”

Again, this sentence occurs in the introduction merely to provide background. The mutual information estimation problem is a well-recognized open problem. That is all we are saying. The purpose of our paper is *not* to solve this problem, nor do we claim that this is what we are doing.

– “Their criticisms of the MIC measure of Reshef et al. (see above link to that earlier post for background) may well be reasonable,”

Critiquing MIC is one of the major purposes of our paper. Having read our paper, are you still unsure whether our critique is even “reasonable”?

– “but, again, there are points where they (Kinney and Atwal) may be missing some possibilities. ”

This is true of every paper, is it not?

– “For example, they write, “nonmonotonic relationships have systematically reduced MIC values relative to monotonic ones” and refer to this as a “bias.” “

Not having this bias is precisely what “equitability” means. This notion of equitability is a central concept in the paper of Reshef et al.. The *entire* reason they give for introducing MIC is the claim that it does not have this specific kind of bias. We show that this claim is wrong. Now, you might not agree that equitability is a sensible concept. That’s fine. But your comment makes it sound like claiming MIC has this bias is a weak criticism, where in fact it strikes at the heart of the reason for introducing MIC in the first place.

– “But it seems to me that nonmonotonic relationships really are less predictable…would be a weaker association than a simple y=a+bx.”

Claude Shannon solved this problem: no, nonmonotonic relationships are *not* inherently less predictable than monotonic relationships. Why speculate about matters that can be quickly and rigorously settled by elementary information theory arguments? Even if you disagree with the tenets of information theory, at least let the reader know that there is a well-developed field of mathematics that purports to answer this question.

– “The two papers under discussion differ not just in their methods and assumptions but in their focus…. I’d be interested in seeing how it looks.”

Please see my above response to this paragraph. I critiqued only this paragraph initially because I felt that the issues it raised were the most important to address.

– “I imagine that different measures of dependence could be useful for different purposes.”

This statement is profoundly unhelpful to the reader if you don’t provide any guidance on which dependence measures to use in which situations. And in the case of MIC I think this statement is simply wrong: a major conclusion of our paper is that estimating MIC *never* makes more sense than simply estimating mutual information. We put a lot of thought, effort, and time into this conclusion, and have spelled our our reasons clearly in our paper.

• Andrew says:

Justin:

Thanks. No problem about being combative. No offense is taken. You and your colleagues have worked hard on this issue and it’s important for both theoretical and practical reasons. So you have every right to get annoyed if you feel your ideas are getting distorted in transmission!

Anyway, to clarify:

1. When I say you “seem to aim for what might not be possible,” this is connected to your statement that, “How to compute such an estimate that does not bias the resulting mutual information value remains an open problem.” From Wikipedia: “In science and mathematics, an open problem or an open question is a known problem which can be accurately stated, and which is assumed to have an objective and verifiable solution, but which has not yet been solved (no solution for it is known).” My problem with your statement is that it seems to imply that there can be such an unbiased estimate, with the only open problem being how to compute it. But my guess (just based on general knowledge of statistics, not on this particular topic) is that such an unbiased estimate probably does not in general exist. If so, computing such an estimate would be impossible, it would sort of be like describing “how to trisect an arbitrary angle using only compass and straightedge” as an open problem in geometry. And this gets back to why I think you seem to be aiming for what might not be possible.

2. You write, “no, nonmonotonic relationships are *not* inherently less predictable than monotonic relationships.” It depends on what measure of predictability you are using. As I wrote in the above blog entry, you seem to be working with the distribution of y conditional on x, but there are other things that could be studied. In your example, you show two graphs where the predictive error of y given x is identical, but in which the predictive error of x given y is much different. It’s not a matter of disagreeing with the tenets of information theory, it’s just a matter of there being different questions that could be asked.

3. When I wrote, “I imagine that different measures of dependence could be useful for different purposes,” I recognize that I am not an expert in this area. I understand where you’re coming from because I too have written papers where I argue that my proposed approach dominates various existing methods (see, for example, here and here). And, indeed, I too find it super-frustrating when people continue to use the old method, presumably just because the old method is there and the users are too busy to switch. All I can say is that, although I’m very interested in the topics of your paper and that of Reshef et al., I’ve only read the papers quickly and I’m no expert here. So it would be inappropriate for me, given my current level of understanding, to say Kinney’s right and Reshef is wrong. I just don’t know! Right now what I do know is that both of you (as well as other groups) are working on these problems, and I think a useful contribution I can make here is to post these papers along with my perspectives as an applied statistician who thinks the subject is interesting and important.

• Regardless of our different opinions on these points, I would again like to thank you for hosting this discussion, and for engaging in this back-and-forth. And thank you as well for being understanding about my objections. I think it is important for scientists — especially young scientists — to see that it is not uncivil to openly and forcefully disagree about scientific matters. Cheers, -Justin

8. […] noticed that the important topic of association measures and tests came up again in your blog, and we have few comments in this […]

9. Ryan Compton says:

Hi, thanks for informative discussion. In addition to points already made, here’s a couple things about MIC that need to be emphasized for people outside academia: 1. the MIC software is closed source 2. There’s a patent https://www.google.com/patents/WO2013067461A2

10. […] 14 Mar 2014: The maximal information coefficient […]