## Further thoughts on nonparametric correlation measures

Malka Gorfine, Ruth Heller, and Yair Heller write a comment on the paper of Reshef et al. that we discussed a few months ago.

Just to remind you what’s going on here, here’s my quick summary from December:

Reshef et al. propose a new nonlinear R-squared-like measure.

Unlike R-squared, this new method depends on a tuning parameter that controls the level of discretization, in a “How long is the coast of Britain” sort of way. The dependence on scale is inevitable for such a general method. Just consider: if you sample 1000 points from the unit bivariate normal distribution, (x,y) ~ N(0,I), you’ll be able to fit them perfectly by a 999-degree polynomial fit to the data. So the scale of the fit matters.

The clever idea of the paper is that, instead of going for an absolute measure (which, as we’ve seen, will be scale-dependent), they focus on the problem of summarizing the grid of pairwise dependences in a large set of variables. As they put it: “Imagine a data set with hundreds of variables, which may contain important, undiscovered relationships. There are tens of thousands of variable pairs . . . If you do not already know what kinds of relationships to search for, how do you efficiently identify the important ones?”

Thus, Reshef et al. provide a relative rather than absolute measure of association, suitable for comparing pairs of variables within a single dataset even if the interpretation is not so clear between datasets.

I followed up with some questions, and there were many comments, including this link from Rob Tibshirani to a paper with Noah Simon, who conclude:

We [Simon and Tibshirani] believe that the recently proposed distance correlation measure of Székely & Rizzo (2009) is a more powerful technique that is simple, easy to compute and should be considered for general use.

OK, now what we’re up to speed, here’s the comment from Gorfine:

Reshef et al. present a clever approximation of the brute force approach to detecting dependencies of going over all possible grids. Their method however does have some serious drawbacks:

1) My collaborators Ruth and Yair Heller and I conducted a simulation study to compare the power of MIC to two other methods Dcor (as in Professors Tibshiranis comment above) and HHG (http://arxiv.org/abs/1201.3522). From the study it was clear that for certain data sets MIC suffers from extremely low power compared to the other methods. A detailed description of the simulations and the power issue can be found here (http://ie.technion.ac.il/~gorfinm/files/science6.pdf).

2) In a personal communication the authors explained that the main point of their method is equitability (i.e. two relationships with the same noise level will get the same score) and not power. We believe that the equitability characteristic of the method is not very useful for the following reasons:

a) If you have low power and cannot detect much, equitability will not help you.

b) The authors prove equitability only for relationships without noise – which is never the case in statistics. Giving a few examples of equitability for noisy functions does not constitute a proof.

c) In our simulation study mentioned above, based on practical sample sizes such as 30, 50 or 100, we show that the MIC test gives poor performance in terms of equitability. It gives different relationship types different scores and thus different power, its degradation as noise is added is highly dependent on the specific relationship type in question.

3) The authors give a few noisy examples for which their proofs do not hold (e.g. L shaped relationship) and try to demonstrate that they can be equitable even in such cases. There is however a simple counter example which shows that MIC is not equitable for all relationships: Generate a dataset that is Uniform on [0;1]x[0;1] and uniform on [1;2]x[1;2]. A 2-field checkerboard (or in fact, you can also try a larger checkerboard!). It scores a maximal MIC of 1.0, just like y=x (credit to a post on http://statmodeling.stat.columbia.edu/2011/12/mr-pearson-meet-mr-mandelbrot-detecting-novel-associations-in-large-data-sets/).

4) If we understood correctly, almost all the proofs in the paper are about the full brute force method which tries out all possible grids and not about the actual MIC approximation. Specifically the authors do not prove that their approximation is statistically consistent against any alternative. Which means that even with infinite data the researcher cannot be sure that if there is dependency MIC will find it.

5) As MIC uses an approximation it has quite a few unjustified heuristics:

a) A parameter of n^0.6 without justifying why it is better than say n^0.7.

b) A parameter for the number of clumps which is set at 15 without justification.

In fact looking at section 4.1 of the SOM (page 15) the authors really did play with these 2 parameters (they have a different value for them for every figure). Trying multiple values of a statistical test, invalidates the p values found. The authors should report the findings for the default parameter settings.

6) MIC is relevant only for univariate data while HHG and Dcor work also in a multivariate setting.

Due to all these drawbacks, our bottom line is that the two other methods mentioned in our comment are superior to MIC and we recommend that scientists use them rather than MIC.

Perhaps Reshef or one of the other authors of that paper can comment?

1. […] approach has some drawbacks, though, perhaps quite serious. Andrew Gelman’s blog has a good summary of recent commentary. Not surprisingly for such a flexible nonparametric method, it seems to lack […]

2. Corey says:

I ran across references to distance correlation on CrossValidated and downloaded the article, the subsequent discussion, and the rejoinder just two nights ago (I haven’t got through the paper yet). It looks very interesting! — but I have yet to figure out how the topic of distance correlation and the topic of Bayesian inference intersect…

3. john says:

4. MINE authors says:

Hi all,
We’re really happy to see so much discussion of MIC and MINE. As John points out, we’d posted a response to both Prof. Tibshirani’s and Prof. Gorfine’s comments on the article’s message board on the Science website. The text is reproduced below:

===

Thank you for taking the time to think deeply about our work. We hold you both in high regard and would like to respectfully respond to your comments.

Though we have great appreciation for both distance correlation (dcor) and HHG, they belong to a large class of methods that address a fundamentally different problem (testing for the presence of statistical dependence) than the one we approached (quantifying the strength of a dependence in an equitable way). In our work, we were responding to the recognized burden of many researchers in today’s data age: identifying a relatively small set of strongest associations, as opposed to finding as many non-zero associations as possible, which often are too many to sift through.

While we agree, of course, that a method with better power is always preferable if all other things are equal, we disagree that an equitable statistic is only useful if it is better powered than the state of the art. The desiderata of a statistic are a function of the problem it is being used to solve. For instance, distance correlation may have excellent power, and it is innovative in that it can be applied to relationships between vectors of arbitrary dimension, but as Supplementary Figure 3 of our paper shows, it is unfortunately among the worst-performing statistics in terms of equitability. So though it is certainly useful for solving other problems, it is not well suited for the problem we posed, because our problem really calls for an equitable statistic. MIC may not be as powerful as dcor or HHG, but our analyses of real data sets show that this is not crippling: MIC still has enough power to find a wealth of meaningful and interesting relationships when used together with appropriate significance testing. Thus, given the fact that MIC is much closer to equitability, we think it’s better suited for our data exploration problem, in which we’re not concerned only with finding as many relationships as possible.

The field of data exploration has already benefited greatly from the combination of mindsets and priorities arising from its interdisciplinary nature. We hope that our conversation can become a constructive part of this process and contribute to the further development of the discipline.

===

While our response to you and Prof Tibshirani speaks to your major points 1 and 2, we also wanted to address your other points:

3) We are aware of this “checkerboard” example, and we can see why you might view the fact that this gets an MIC of 1 as problematic. However, over the course of our work we came to view this as desirable, because it captures the “relationship” sgn(x) = sgn(y), and from the point of view of data exploration, we would think that for someone exploring an unknown data set, this would certainly be a pattern one would want to see at the top of the list.

Nevertheless, as our paper discusses, when we view the entire characteristic matrix (instead of just its maximum, which is MIC), we can make more refined judgments as to what relationships are interesting. In particular, for the checkerboard example, if we look at m_{x,y} values where x,y are larger than 2, we see a drop-off in scores that we would not see for other relationships. This is pictured in Figure 3E of our paper. So if this type of relationship is deemed uninteresting, you can modify or supplement MIC using additional properties of the characteristic matrix to filter it out.

4) It’s true that we proved only that MIC–rather than Approx-MIC–will detect any deviation from independence. This is because we see MIC as our main contribution and expect that Approx-MIC will be improved upon. However, all of our proofs for functional relationships and superpositions of functional relationships actually work for the approximation algorithm as well (you can see this by looking at the grids constructed in those proofs).

5+6) We agree that our work leaves many questions open. This is because we had intended our paper to be the first on MIC and equitability, not the last word on the subject. Much is left to understand both about the former (e.g. what really is the best way to choose the exponent? How can Approx-MIC be improved? How should MIC be generalized and computed for higher dimensions?) and the latter (e.g. is there a more precise notion of equitability for non-functional relationships? For relationships of higher dimension?). We look forward to pursuing these and other questions over the years ahead, and hope to work with you and other researchers to tackle the important challenges of the data age.

5. Phil says:

Andrew, I haven’t read the paper, but from the descriptions on your blog it seems like it might be related to our morphing work, inasmuch as that quantifies how much do you have to distort Curve B in order to make it look like Curve A. I suppose I’m not really making a comment on Reshef et al. or Gorfine et al., just trying to popularize that paper of ours, since I think we were really onto something there and I regret never having found a way to pursue it.

6. C Ryan King says:

It’s interesting that because of the very different smoothness constraints effectively assumed by the histogram method going on under MIC it does so well for high-frequency and variable-frequency periodic curves. These are the worst departures from equitability in their Fig S3.