Mr. Pearson, meet Mr. Mandelbrot: Detecting Novel Associations in Large Data Sets

Jeremy Fox asks what I think about this paper by David N. Reshef, Yakir Reshef, Hilary Finucane, Sharon Grossman, Gilean McVean, Peter Turnbaugh, Eric Lander, Michael Mitzenmacher, and Pardis Sabeti which proposes a new nonlinear R-squared-like measure.

My quick answer is that it looks really cool!

From my quick reading of the paper, it appears that the method reduces on average to the usual R-squared when fit to data of the form y = a + bx + error, and that it also has a similar interpretation when “a + bx” is replaced by other continuous functions.

Unlike R-squared, the method of Reshef et al. depends on a tuning parameter that controls the level of discretization, in a “How long is the coast of Britain” sort of way. The dependence on scale is inevitable for such a general method. Just consider: if you sample 1000 points from the unit bivariate normal distribution, (x,y) ~ N(0,I), you’ll be able to fit them perfectly by a 999-degree polynomial fit to the data. So the scale of the fit matters.

The clever idea of the paper is that, instead of going for an absolute measure (which, as we’ve seen, will be scale-dependent), they focus on the problem of summarizing the grid of pairwise dependences in a large set of variables. As they put it: “Imagine a data set with hundreds of variables, which may contain important, undiscovered relationships. There are tens of thousands of variable pairs . . . If you do not already know what kinds of relationships to search for, how do you efficiently identify the important ones?”

Thus, Reshef et al. provide a relative rather than absolute measure of association, suitable for comparing pairs of variables within a single dataset even if the interpretation is not so clear between datasets.

I suspect that the R-squared-like nature of the method will bother some purists who have bashed R-squared’s dependence on the range of x in the data. But, as regular readers will know, I like R-squared (in its place) and I have warm feelings about this generalization.

I leave the authors with two questions:

1. What is the value of their association measure if applied to data that are on a circle? For example, suppose you generate these 1000 points in R:

n <- 1000 theta <- runif (n, 0, 2*pi) x <- cos (theta) y <- sin (theta)

Simulated in this way, x and y have an R-squared of 0. And, indeed, knowing x tells you little (on average) about y (and vice-versa). But, from the description of the method in the paper, it seems that their R-squared-like measure might be very close to 1. I can't really tell. This is an interesting to me because it's not immediately clear what the right answer "should" be. If you can capture a bivariate distribution by a simple curve, that's great; on the other hand if you can't predict x from y or y from x, then I don't know that I'd want a R-squared-like summary to be close to 1.

No measure can be all things to all datasets, so let me emphasize that the above is not a criticism of the idea of Reshef et al. but rather an exploration.

2. I wonder if they'd do even better by log-transforming any variables that are all-positive. (I thought about this after looking at the graphs in Figure 4.) A more general approach would be for their grid boxes to be adaptive.

27 thoughts on “Mr. Pearson, meet Mr. Mandelbrot: Detecting Novel Associations in Large Data Sets

  1. Thanks for posting on this Andrew, your comments are good ones. I too find the approach very interesting, and your comments raise some points I hadn’t thought of. Your question about data on a circle is a good one–as you say, it’s not intuitively clear what the right answer *ought* to be in such a case. If I had to guess, I’d say that the Reshef et al. measure of association might be >0 in this case (maybe 0.5?), because knowing x does give you some information about y (specifically, that y takes on one of two values, right?). But I’m not at all sure about this.

    There’s also a “layman’s” write up of the article in Science, which raises some other good points. For instance, can you convert this measure of association into a conditional measure, in order to obtain something analogous to a partial correlation or partial regression?

  2. Can you say more about using log-transformed data here?

    I’ve had occasion to do that, but for ratio data where numerator/denominator choice was arbitrary and thus log-trasnformed data was necessary to get consistent statistical measures regardless of the arbitrary choice.

  3. I was thinking that they seem to have taken the hard part of any association metric which relies on density (doing a multivariate density estimate) and replaced it with something which is easy to analyze as the sample goes to infinity (the set of grids creating a discrete bivariate distribution) but which may not make any sense in particular datasets. I haven’t gone through the regularization tricks (resolution limits) which prevent the grid-size from getting too fine, but that seems very interesting. I hope that it works well in practice and is not just handwaving away the difficult problem of tuning a density estimate. Resolution-limits are very attractive, similar to Jeffery’s prior in detecting and giving up on settings which are badly identified. The tricks on how one can squeeze interpretation out of the behavior of the matrices was also fascinating.

    I suspect that any kind of density estimation (even discrete in this case) may be solving an unnecessarily hard intermediate problem; similarly to this from Perez-Cruz which replaced KL divergence estimates based on direct density estimates with ones based on spacing, my suspicion is that solving the integrated problem by estimating the integrand is not necessarily the best way to go. Of course there may be no easy route to make MI work; this paper noted that of the many KL divergence estimators available, none were very good at even their toy problems. MI is just KL from the joint density to independence, so one expects the same.

  4. Isn’t the circle covered by the “finite union of differentiable segments” part of their proof (which is: “if the support of (X,Y) is described by a finite union of differentiable curves of the form c(t) = [x(t),y(t)] for t in [0,1], then data drawn from (X,Y) will receive an MIC tending to 1 with probability one as sample size grows, provided that dx/dt and dy/dt are each zero on finitely many points”)

    It looks, though, like your example of a circle highlights a potential practical difficulty of this method: their figure 3 shows their nonparametric correlation measure as a function number of horizontal and vertical bins. Most noiseless relationships can be detected (normalized score=1) with a small number of bins, but for the circle the score continues to increase monotonically with the number of bins, not reaching 1 (it’s asymptote) until at least a 30-by-30 grid.

  5. I need to look at it in more detail, but I thought the idea was that it’s the maximum of the mutual information over multiple discretization scales. MI should work just fine regardless of the functional form relating the two variables. There may be computational issues, as Dave says, (and maybe that’s what you’re referring to) but conceptually at least, I don’t see a problem…

  6. “If you can capture a bivariate distribution by a simple curve, that’s great; on the other hand if you can’t predict x from y or y from x, then I don’t know that I’d want a R-squared-like summary to be close to 1.”

    This is kind of an interesting thought to ponder. I usually hear about MI in the context of compression and this may be highlighting the distinction between compression and prediction. With the circle, if you know X, you can compress a Y signal very well, but if you only have 1 guess, your prediction won’t be very good (in an MSE sense).

  7. I’d also like to have a sense of how data hungry it is. How well does it work, as a function of the number of variables, the number of observations, and the number and strength of the bivariate relationships?

  8. Great post! It seems as if they model a circle in figure F on page 1520, and get an MIC of approximately .7 (hard to see on the 3D rendering). Or am I missing something? In any case, they have provided an R package here http://www.exploredata.net/Downloads so it should be possible to try it out on a range of figures (multiple circles?).

  9. Pingback: datanalytics » ¿La correlación “del siglo XXI”?

  10. Surely in the circle example x and y are very tightly associated – only the sign is uncertain. It is the measure of predictability (mean error) that has shortcomings in this scenario.

    • Eugene:

      I don’t think there’s any unambiguously right answer here. Different measures are answering different questions. On one hand, getting a prediction down to two points is pretty good. On the other hand, if those two points are far apart, this could be a problem. Under some measures you could get an association of 0, under others an association of 0.5, under others an association of close to 1.

  11. I tried this through the R interface on an existing dataset that I’ve been working with. The results were in broad agreement, but I don’t quite know how to interpret the odd case where the standard R^2 correlation exceeds the MIC correlation. For instance, I get a case where MIC=.55 and R^2=.62. This gives a MIC-R^2=-.08, which is a measure of nonlinearity — but in the direction of being less than non-linear (more linear than linear?). This case is right next to a case where MIC=.54 and R^2=.54, essentially identical.

  12. In Table S1 in their supporting online material, Reshef et al show a MIC value of 0.71 for a circle and state “Note that the circle and the pair of lines do not receive perfect MIC scores. This is because the scores of these associations approach 1 as sample size tends to infinity, while the table was generated with only n = 1000.

  13. I think it’s instructive to think of MIC as a pairwise *dependence* measure; whereas R-squared is more of a *predictive* measure (except in the simple case where it has a direct relationship with correlation). In the case of the circle, we want/expect high MIC values because x and y are clearly dependent even if prediction of one based on knowledge of the other comes down to a coin toss. Although, my experience with such measures suggests that the dependence metric for a circle should not go asymptotically to 1, because the dependence is obviously imperfect, albeit precisely so.

    The two metrics, in tandem, capture much of your heuristic — on the one hand, there is a clear dependence structure at play; on the other, a linear model will give you little information about y given x (on average).

  14. Note that Figure 2G includes a couple of many-to-many associations, including an ellipse and various superpositions of two functions. The table shows that without added error, the MIC can be as high as 0.8

    As for log transform, their dynamic programming heuristic to find the optimal partition should in theory be able to handle such unequally distributed data. Remember, they’re not using a fixed-size grid. In principle, if you apply any strictly monotonic function to both variables, you should get the same MIC. That’s *in principle*, of course. Their dynamic programming routine to find the optimal partitioning is only a heuristic, after all. I guess it wouldn’t hurt to do a rank ordering on the variables first, to avoid problems the heuristic might have with very clumpy distributions.

  15. Actually, never mind – reading through the algorithm a bit more, I don’t think that a clumpy distribution would make much of a difference. Exact ties might, but rank ordering won’t help any there.

  16. My own intuition for their measure is something like “the extent to which the scatterplot of x & y looks like a 1-manifold (possibly with intersections).” Which maybe suggests ways to extend the work to higher orders, and probably also suggests existing related work in the literature on measuring the natural dimensionality of a data set.

  17. Okay, not cool: MINE is licensed under a Creative Commons (okay, good so far!) Attribution (yes, good!) NonCommercial (what? well, that’s a drag, but okay…) NoDerivs (are you serious???) 3.0 Unported License.

    In other words:

    “No Derivative Works — You may not alter, transform, or build upon this work.”

    If you can’t build upon the work, that pretty much takes it out of the realm of academic scholarship, in my opinion.

  18. With my student Noah Simon, I have run fairly extensive power comparisons between MINE, Pearson Correlation and distance
    correlation (dcor) of Szekely et al.
    In almost all of the cases that we looked at, MINE has lower power than dcor.
    MIC is sometimes less powerful than Pearson correlation as well,
    the linear case being particularly worrisome- the less of power is very substantial.
    Dcor is simpler, easy to compute and appears to have better power in general.
    We are submitting a short letter to Science on this.

  19. Generate a dataset that is Uniform on [0;1]x[0;1] and uniform on [1;2]x[1;2]. So a 2-field checkerboard (or in fact, you can also try a larger checkerboard!).

    It scores a maximal MIC of 1.0, just like y=x.

  20. Joshua is correct- Brownian distance correlation is what I was referring to.

    Andrew- Cosma Shalizi told me the same checkerboard counter-example, and apparently the authors of the Science
    are aware of it.

    • Rob: Good of you to post this.

      Interesting how simulation has become the great equalizer.
      (That possibilty used to irk some statisticians.)

Comments are closed.