Discovering general multidimensional associations

Continuing our discussion of general measures of correlations, Ben Murrell sends along this paper (with corresponding R package), which begins:

When two variables are related by a known function, the coefficient of determination (denoted R-squared) measures the proportion of the total variance in the observations that is explained by that function. This quantifies the strength of the relationship between variables by describing what proportion of the variance is signal as opposed to noise. For linear relationships, this is equal to the square of the correlation coefficient, ρ. When the parametric form of the relationship is unknown, however, it is unclear how to estimate the proportion of explained variance equitably – assigning similar values to equally noisy relationships. Here we demonstrate how to directly estimate a generalized R-squared when the form of the relationship is unknown, and we question the performance of the Maximal Information Coefficient (MIC) – a recently proposed information theoretic measure of dependence. We show that our approach behaves equitably, has more power than MIC to detect association between variables, and converges faster with increasing sample size. Most importantly, our approach generalizes to higher dimensions, which allows us to estimate the strength of multivariate relationships (Y against A,B,…) and to measure association while controlling for covariates (Y against X controlling for C).

And, since we’re talking about R-squared, let me point you to my 2006 paper with Iain Pardoe, Bayesian measures of explained variance and pooling in multilevel (hierarchical) models.

4 thoughts on “Discovering general multidimensional associations

  1. I’ve been looking around for a way to approach this problem of correlation of correlations, any guidance?

    We have L vectors of the same length, values in the vectors can be discrete values within [0,1,2]

    We construct an (L x L) matrix M, that describes the correlation (R-squared) between all pairs of L.
    Because M is a square matrix, and the diagonal of M compares each locus to itself (R-squared = 1), we actually have (L choose 2) separate measures of correlation. (L choose 2) = (L*(L-1 ))/2)
    We will call this set of correlations T

    We calculate the mean of T.
    Given that our L vectors are correlated, how do we place confidence intervals around the estimate of the mean value of T?
    The more the L vectors are correlated, the fewer effective independent measures go into T, but how can we approach this analytically?

  2. > We wish to constrain A to vary between 0 and 1, so we cannot allow the null to outperform the alternative model, lest A become negative. We thus define the density of the alternative model at each sample point to be a weighted mixture of dependent (full joint) and independent (product of marginal) models, with a single mixture parameter controlling the proportion for all points.

    Why do they need to do this? Is the dependent model not flexible enough already to include the null as a special case already? It seems like it is.

    P.S. They should state at the beginning that SI refers to Supporting Information, references were opaque until last line of paper. Very nice overall!

    • Thanks for the comments!

      To clarify, the joint kernel model doesn’t recapitulate total independence for any finite number of points (although it may get there as N gets large). We found that setting the alt up as a mixture of independent and dependent components lets it better handle total independence for small sample sizes, which prevents the curves at the top left of figure 2 from “bottoming out” and never quite getting to 0 as the signal vanishes (which is one of the undesirable features of MIC). The mixture approach also handles outliers quite nicely, letting there be some proportion of the data that doesn’t need to belong to the dependent set of samples.

      And thanks for pointing out the SI issue. We’ll clarify when we submit.

Leave a Reply

Your email address will not be published. Required fields are marked *