Don’t do the Wilcoxon

The Wilcoxon test is a nonparametric rank-based test for comparing two groups. It’s a cool idea because, if data are continuous and there is no possibility of a tie, the reference distribution depends only on the sample size. There are no nuisance parameters, and the distribution can be tabulated. From a Bayesian point of view, however, this is no big deal, and I prefer to think of Wilcoxon as a procedure that throws away information (by reducing the data to ranks) to gain robustness.

Fine. But if you’re gonna do that, I’d recommend instead the following approach:

1. As in classical Wilcoxon, replace the data by their ranks: 1, 2, . . . N.

2. Translate these ranks into z-scores using the inverse-normal cdf applied to the values 1/(2*N), 3/(2*N), . . . (2*N – 1)/(2*N).

3. Fit a normal model.

In simple examples this should work just about the same as Wilcoxon as it is based on the same general principle, which is to discard the numerical information in the data and just keep the ranks. The advantage of this new approach is that, by using the normal distribution, it allows you to plug in all the standard methods that you’re familiar with: regression, analysis of variance, multilevel models, measurement-error models, and so on.

The trouble with Wilcoxon is that it’s a bit of a dead end: if you want to do anything more complicated than a simple comparison of two groups, you have to come up with new procedures and work out new reference distributions. With the transform-to-normal approach you can do pretty much anything you want.

The question arises: if my simple recommended approach indeed dominates Wilcoxon, how is it that Wilcoxon remains popular? I think much has to do with computation: the inverse-normal transformation is now trivial, but in the old days it would’ve added a lot of work to what, after all, is intended to be rapid and approximate.

Take-home message

I am not saying that the rank-then-inverse-normal-transform strategy is always or even often a good idea. What I’m saying is that, if you were planning to do a rank transformation before analyzing your data, I recommend this z-score approach rather than the classical Wilcoxon method.

52 thoughts on “Don’t do the Wilcoxon

  1. This proposal makes some sense, but I think you’ve overstated the advantage. When you say that the problem with Wilcoxon is that “if you want to do anything more complicated than a simple comparison of two groups, you have to come up with new procedures and work out new reference distributions” that’s true, but the computing revolution has not only made your “inverse normal Wilcoxon” easy to implement, it’s made the new reference distributions easy to do via now easy-to-implement randomization methods. I grant they’re not *as* easy as your method, but they have the advantage of being precisely tunable to the problem you have rather than simply transformed into normal data which is then subject to the traditional toolbox. it’s that transformation step which may be dramatically worse for the particular problem that you have…

    • Jonathan:

      I think my real point is that it’s easy to take my procedure and generalize it and run regressions. With Wilcoxon, each new step in the modeling process requires a new formulation, new calculation of significance levels, etc. Yes, this can be done with modern computation, but it gets in the way of modeling. In my recommended framework, once you’ve bitten the bullet and replaced the data by ranks, you can jump straight to regression modeling and focus on the applied problem.

      • The regressions are easy to run, but doesn’t this transform make the regression coefficients very hard to interpret? Instead of a regression coefficient meaning, say, an expected difference in outcome (or similar – the early chapters of your book with Jennifer Hill are extremely clear about what coefficients can mean) with the transformation you’d be describing mean difference in outcome-renormalized-by-a-complicated-function-of-all-the-outcomes. It’s not an invalid approach – it’s known to be somewhat useful for testing, and several regression methods using normal scores exist already – but I don’t see how it’s much help for regression modeling.

        If bounded influence is desired for robustness, the outlier-robust adaptations of regression tools due to Huber and others could be used. These don’t require transformations of the data.

        • George:

          I’m not saying you should transform the data. I’m saying that if you were going to do Wilcoxon, then you’re already doing a rank transformation, in which case I think it makes sense to do it in an expandable way, rather than getting hung up on some pre-tabulated test statistics.

        • Understood. But why expand it to regression? Isn’t it better to first go back to the original data and think, from there, about what a regression analysis might usefully do?

          I doubt that, unless the choice of analysis tools is very limited, modeling inverse-Normal-ranking of the outcomes would be the stand-alone analysis of choice. So doing this as a follow-on to preliminary Wilcoxon or Van der Waerden tests (neither of which appeal much on their own) rather than starting over, seems to sacrifice a lot just for compatibility.

          Apologies for going on about this – I agree with other comments that what’s recommended here will get used in practice, so we should discuss methods’ drawbacks and alternatives, as well as benefits (e.g. robustness to outliers, here).

        • George:

          I agree with you on the substance. Again, my point is that if you were going to do Wilcoxon, then you’ve already committed to doing ranks, at which point I think it makes sense to model those ranks rather than to frame as a hypothesis test which just happens to have been tabulated by somebody once. In 1960 it would’ve been a different story. But in 2015, if you want to replace the data by ranks, I say replace the data by ranks and then fit a model.

  2. Scott Halpern, Meeta Prasad Kerlin, Dylan Small, and I have a paper discussing some other issues with the Wilcoxon test (see section 3):

    gated: http://doi.org/10.1177/0962280214545121

    ungated: http://www.columbia.edu/~wl2513/icu-los.pdf

    We aren’t making original points there (the heavy lifting was done by several papers we cite, including a recent one by EunYi Chung and Joseph Romano), but we give some simulations that may help illustrate some of the issues. As we write in section 3.2, the Wilcoxon test “is valid for the strong null [the hypothesis that treatment has no effect on any patient], but it is sensitive to certain kinds of departures from the strong null and not others. For example, it is more likely to reject the null when treatment narrows the spread of the outcome distribution and there are more treated than control patients, or when treatment widens the spread and there are more control than treated patients. It is less likely to reject when the opposite is true. These properties complicate the test’s interpretation and are probably not well known to most of its users.”

    The rank-then-inverse-normal-transform test is also known as the van der Waerden or normal scores test:

    https://en.wikipedia.org/wiki/Van_der_Waerden_test

    John W. Pratt’s 1964 paper on the (non)robustness of the Wilcoxon, van der Waerden, and other tests is very good (but not easy reading):

    http://www.jstor.org/stable/2283092

    We’ve also cited this very good post by Thomas Lumley:

    http://notstatschat.tumblr.com/post/63237480043/rock-paper-scissors-wilcoxon-test

    As Cyrus Samii said when I shared Lumley’s post with him, “I see it as another reason to put stock in methods geared toward estimation, as it is in doing so that one operates on the scale necessary to make the kinds of tradeoffs that resolve such transitivity problems.”

    http://cyrussamii.com/?p=1622#comment-1339

  3. * Why is Wilcoxon popular?
    Probably because it’s simple to explain to newbies. If you want to convey the logic of reference distributions and statistical tests, with minimal overhead (no asymptotic justifications, no “complicated” formulas for the Normal distribution, etc.), this is a good place to start.

    * A concern about your proposed alternative:
    If you transform the ranks to z-scores… don’t you just lose the robustness that you were trying to gain by using ranks in the first place?

    * A different alternative:
    If you want nonparametric robustness, but want to do more than just compare 2 groups, why not just move to a general resampling or permutation approach instead?

  4. The transformation to ranks makes the data essentially uniform under the null hypothesis and you only need around 5 or 6 uniform values for the mean to be approximately normal (Central Limit Theorem does not need very big sample size for uniform). So for many cases (sample sizes where the smallest group is at least 5) applying the normal model to the ranks directly will also work.

    My argument for not doing the Wilcoxon is that with modern computing we can easily do a permutation test on the original data looking at the actual parameter(s) of interest rather than trying to understand/explain the parameter tested by the Wilcoxon.

  5. How does your approach of rank-then-inverse-normal-transformation compare to regression on order statistics:
    http://link.springer.com/article/10.1007%2Fs00362-009-0294-9 (paywall for those of not in a University)

    I find myself holding my nose and doing the Wilcoxon or the slightly more general Gehan test quite a bit these days. I work in the world of environmental remediation and so find myself up against EPA and their stubborn ways. There is this continual issue of non-detects/censored (chemical concentrations that fall below some specified detection limit). Given that this is spatial data, and that chemical results are collected a suite of measurements (e.g. there are 209 PCBs, but in any given sample somewhere between 0 and 209 will be censored), I try to argue that there is more information in the design and data than simply doing a Gehan test to compare site and reference areas or a Kaplan-Meier estimate for total chemical exposure. But the EPA doesn’t necessarily want to hear about some crazy multivariate multiple imputation spatial model that I would love to pursue as a research project. (Here I would use spatial correlation as well as correlation among chemical results to impute the censored values and calculate means/totals what have you. It’s all intuition and imagination at this point though. (Double-parenthetically: anybody looking for a PhD student? :) )) So instead I find myself using the Gehan test, because it’s what they know. There is more precedence for regression on order statistics than for some novel spatial model, but I haven’t explored it deeply enough to consider whether it is better than the simpler rank based methods.

  6. What a weird notion. Interesting too. But no rationale? No evidence? Not justification for the z-scoring? The whole point of non-parametric tests is to get rid of assumptions like the normality assumption. It is natural that by giving away these assumptions you lose some sensitivity, but in some cases, that’s the right trade-off.

    • Ahmed:

      No, the Wilcoxon test has a huge assumption which is that the distributions are identical in the two groups. This is a much more important assumption than normality. You are expressing a common misunderstanding, which is to overrate the importance of distributional assumptions and to understate the importance of structural assumptions such as additivity.

      • “assumptions not needed if it’s non-parametric! works even if your data aren’t normal! free lunches for all if you’ve learned the right test!”

        Same BS with the fisher exact test and KS tests…

      • No, distributional assumptions are only necessary if one’s inquiry is about the medians. If one’s entire interest is the dominance of one sample over the other, there are not distributional assumptions involved in the test.

  7. At the risk of going all Erich Lehmann, the main justification for using the Wilcoxon test is that its worst case Pitman efficiency relative to the t-test is approximately 0.86 (assuming a common symmetric error distribution for the two groups). Of course, the assumptions underlying this calculation never hold but that can be said of practically any procedure.

    An estimator of the effect size based on the Wilcoxon test is the Hodges-Lehmann estimator, which takes the median of all possible differences between the two groups – a pretty sensible and fairly robust estimator.

      • Corey: presumably it’s consistent for the median of all possible differences between the two groups in the population? Unless something very weird happens…

      • It would be the median of the distribution of X-Y where X and Y are independent with distributions F (treatment) and G (control), respectively.

        • It’s worth emphasizing that there isn’t any functional of the two distributions *separately* that corresponds to the test — the median difference isn’t (of course) the same as the difference in medians, nor is it the difference in any other one-sample statistic.

          If there were such a functional, the test would necessarily be transitive, and it’s not.

  8. I think you ought to be more explicit about the decision criteria:

    1. Ease of exposition. Easier for W if you are a newbie; easier for regression if you have taken some applied stats courses (pretty much all they teach is regression).

    2. Ease of use. With modern software I don’t see a difference.

    3. Substantive insight. I’d say W is useful as part of an analysis process that starts with hypothesis testing, and then leads to estimation (on original scale). (I can think of few instances where the effect on ranks is the parameter of interest. So your regression estimate is, in may view, as much of a dead end as you claim W is). If you are going to estimate estimate something useful. Is like Bayes, if you are going to use, you might as well use informative priors.

    4. Efficiency. In small samples I understand W will do better. But I have seen conflicting advice.

    So in small samples I use 4 for efficiency, ask as little as possible from the data. Then do 3. I might reject the sharp null but not find a statistical significant quantity of interest. If you are starved for evidence this is better than nothing.

  9. Unless I’m misunderstanding you, your test seems to have very bad power compared to the Wilcoxon test

    rm(list=ls())
    sims <- 1000
    wilcoxrejections <- 0
    gelmanrejections <- 0

    delta <- 0.5
    for (s in 1:sims) {
    y1 <- rnorm(50,0,1)
    y2 <- rnorm(50,delta,1)
    if (wilcox.test(y1,y2)$p.value < 0.05) {wilcoxrejections <- wilcoxrejections+1}

    ranks <- order(c(y1,y2))
    y1 <- ranks[1:length(y1)] / (2*(length(y1)+length(y2)))
    y2 <- ranks[-(1:length(y1))] / (2*(length(y1)+length(y2)))
    if (t.test(y1,y2)$p.value < 0.05) {gelmanrejections <- gelmanrejections+1}

    }
    wilcoxrejections/sims
    [1] 0.686
    gelmanrejections/sims
    [1] 0.406

  10. whoops sorry, I’m half asleep and screwed that code up massively. I’ve fixed it and the power is identical. Sorry!

    rm(list=ls())
    sims <- 1000
    wilcoxrejections <- 0
    gelmanrejections <- 0

    delta <- 0.5
    for (s in 1:sims) {
    y1 <- rnorm(50,0,1)
    y2 <- rnorm(50,delta,1)
    if (wilcox.test(y1,y2)$p.value < 0.05) {wilcoxrejections <- wilcoxrejections+1}

    N <- length(y1)+length(y2)
    temp <- seq(1,2*N-1,by=2)/(2*N)
    ranks <- temp[rank(c(y1,y2))]
    y1 <- ranks[1:length(y1)]
    y2 <- ranks[-(1:length(y1))]
    if (t.test(y1,y2)$p.value < 0.05) {gelmanrejections wilcoxrejections/sims
    [1] 0.655
    > gelmanrejections/sims
    [1] 0.657

  11. (actually its still wrong because

    ranks <- temp[rank(c(y1,y2))]

    should actually be

    ranks <- qnorm(temp[rank(c(y1,y2))])

    however fixing this doesn't affect the power

  12. Unless I’m confused, which is possible, the whole idea here goes along these lines:

    You have some data which you assume comes from distribution A, and some data which you assume comes from distribution B. Both distributions are “weird” in some way, usually they have fat tails, and/or potentially multi-modal etc.

    If you knew the formulas for distribution A, and B, you could do y_a <- CDF(A)(x_a) and get y_a which would be uniformly distributed on 0,1, this is sometimes called the "probability integral transform". Once you have a random variable which is uniform on (0,1) you can turn it into any kind of random variable you like, but lots of procedures are based on normal variables, so you can do z_a <- INVNORMCDF(y_a) and get z_a which is normally distributed.

    Now, you can do all the modeling you like based on normal data, and then if you have predictions on the normal scale you can take them INVCDF(A)(NORMCDF(predictions)) and get predictions on the original scale… yay.

    Obviously, the exact same logic applies to B, and if the CDF(A) and CDF(B) are known, then you'd be able to transform the x_a and x_b data to have the same exact distribution, namely unit normal.

    The only problem here is that you don't know CDF(A), all you have is a sample from A. So, you assume that the empirical CDF is close to the CDF of A, and you do ECDF(x_a)(x_a) instead of CDF(A)(x_a), however it just so happens that ECDF(x_a)(x_a) is (proportional to) the rank of the data points x_a

    So, the underlying assumption here is really that the empirical CDF of your sample is not too far off from the actual CDF of the random variable A, and ALSO that there IS a distribution A which models the data well (which is already a problem if the data is in a time series which is trending, etc).

    Did I get that more or less correct?

      • Yes, that’s a useful result, especially since I think a lot of people assume that result intuitively, it makes sense if you have a sample of data to assume that the say 90 percentile point of the data is probably a good estimate of the 90 percentile point of the population, but it’s not actually obvious that it HAS to be true. I’d like to see that proof actually because I think the techniques used might be good to know.

        • I guess it’s fairly obvious that it has to be a consistent estimator, if the sample is enormous it converges to the distribution of the population, that it is unbiased is less obvious.

        • For consistency of the quantiles you want the Glivenko-Cantelli lemma, which has a very straightforward proof (unlike the central-limit-theorem extensions). It says that the empirical CDF converges uniformly to the true CDF almost surely, which means the quantiles converge as long as the density is non-zero at the quantile.

          You can make the Glivenko-Cantelli lemma work for many sorts of dependent data as well as iid data, but for iid data there’s even a hard probabilistic bound: for any distribution

          Pr[ sup | ECDF(x) -CDF(x) |> t ] < 2exp(-2nt^2)

          This bound (Massart's inequality) is used in testing the random number generators in R.

        • And it’s not unbiased, but it is asymptotically unbiased: the bias is of smaller order than the standard error, like most parametric estimates and unlike density estimates.

        • The denominator is (N+1), not N; this is what makes it unbiased.

          If a distribution has CDF F(x) and PDF f(x), then the pdf of the r’th ranked observation in an IID sample of size N is proportional to:

          (F(x)^(r-1))·((1 – F(x))^(N-r))·f(x)

          The normalizing constant can be computed by changing the variable of integration from x to F, yielding the Beta function B(r, N – r + 1). Computing the expectation of the F(x) w.r.t this distribution then gives a ratio of Beta functions:

          B(r + 1, N – r + 1)/B(r, N – r + 1) = r/(N + 1)

  13. Doesn’t this generate a single normal distribution out of my data to which I can’t apply things like a regression because the residuals won’t be normally distributed?

    • Psyoskeptic:

      Again, the distribution of the residuals is typically the least important modeling assumption. Really I think the above procedure would work just fine if the regression were performed on the ranks directly. I just suggested the z-score transformation to make things a little smoother. Of course we would not expect linearity or additivity on the ranks or the z-scores—but that’s the inevitable price you pay for throwing away information by replacing the data by ranks.

      • Andrew, you are probably not aware of this so I want to point out (again) that people read your comments about residuals as a carte blanche for using models to pump out a p-value, ignoring all properties of the data. That’s what statistical models are used for in many areas, such as psychology and linguistics: plug data in, push p-value (or t-value) out. That’s it. So, if the model is generating absurd predictions that have no bearing with the data, that’s OK because nobody checks those things, and some psycholinguists quote you as justification.

        • Sharavan:

          I think that if someone was going to do the Wilcoxon test, that the procedure I described in the above post would be better, one reason being that my procedure is not about pumping out p-values etc.

  14. I was under the impression that you were pushing for the end of binary testing altogether. What is the benefit of saying “failed to reject” when you could give a whole distribution? Why do statistical testing at all?

    • Because a persistent finding of the machine learning literature is that if you are interested in a particular quantity (eg the difference between means) then the best result is usually obtained by estimating it directly using as few assumptions as possible, rather than trying to estimate a more complex model (e.g the full probability distribution of both samples) and obtaining your original quantity as a marginal. Check the (enormous) literature on generative vs discriminative classifiers, for example.

      Applying this principle to the difference in mean problem, it seems likely that if you just want to decide whether the means are equal, you should estimate the difference directly rather than (eg) fitting a full probability model to both samples (whether parametric, nonparametric, , whatever) and marginalising. Frequentist nonparametric testing does this in a really nice way that avoids the need for distributional assumptions – rank testing really is one of the jewels in the crown of frequentist statistics, and (imo) one of the few areas where the whole edifice actually makes sense.

      Obtaining a full distribution for the difference in means would generally involve either making parametric assumptions, fitting a full semi/non-parametric model to the whole data (like a Dirichlet process), or using another frequentist nonparametric procedure such as permutation testing. Its not clear that any of these approaches would be better than just estimating the thing you are actually interested in.

      • Anon:

        (1) If your goal is to estimate the difference between means in the population, you can use the difference between means in the sample. But if the underlying distribution has long tails, this won’t be such a good estimate.

        (2) Wilcoxon (or my improved alternative above, which of course makes no assumptions beyond those made by Wilcoxon) does not compute a difference between means. It does a rank transformation, so you’ve already left the grounds of “estimating it directly” and have instead gone over to “estimate something that’s easy to work with and hope for the best” territory. It’s pretty rare that “the thing you are actually interested in” is the population quantity estimated by a difference in ranks.

        (3) Regarding the more general benefits of generative models, I refer you to BDA.

        • Andrew:

          With respect to (2), you can always compute a location difference between the two groups by translating one group so that its mean rank score (e.g. mean rank or mean inverse-normal-normalized-rank) is (approximately) equal to the mean rank score of the other group. Now whether this is a sensible thing to do depends on the problem but it’s not a bad arrow to have in one’s quiver.

        • Keith:

          Sure, if you want to do this, you can do it. But I think the way to do so is to model the ranks directly, not to get tangled up in a permutation test.

  15. First you discard numeric information by transforming to ranks, but then you reassign numeric information, assuming that the distances between consecutive datapoints are equal. As such you treat the variable as it is at ordinal measurement level, but then you suggest to perform analyses that assume interval level. Isn’t that a problem?

  16. I seem to recall you arguing against using ranks and instead using ratings (Deming was quoted). Never clear to me how ratings were defined, but does this apply to this situation?

  17. Notwithstanding Andrew’s point that normal scores may be more useful doing other things with the data than just plain testing, I think there was some work in the 1960s-80s on power of rank tests with different scores in different situations, and I remember that whether the Wilcoxon or the normal scores test was preferable depended on the underlying distribution. I’m not sure whether the difference was large in any situation.

    Personally I think that there are ways to think about what transformation one would want in terms of what I’d call “direct interpretation”, trying to understand what the different methods do to the data. The difference between normal scores and raw ranks is that the effective distance between raw ranks is 1 between any two neighbouring ranks, whereas normal scores put less effective distance between the central ranks and more between the outer ranks. This means that the result of a normal score test are dominated to some extent (although not as strongly as if untransformed data were used and the data came from a distribution with heavier tails than the normal) by the outer ranks. This is a bad thing if one reason for using ranks is that one believes that the more extreme observations are suspicious of being erroneous, and a good thing if one thinks that the more extreme observations are actually more informative indeed – more precisely. the *ranking* of the more extreme observations needs to be more informative, not their values, because otherwise one wouldn’t look at rank scores in the first place (transforming to ranks/rank scores, in my view, implies that we believe, in the given situation, that the information thrown away by the rank transformation will do more harm than good if used, which is certainly the case if there are gross outliers, but also if one has a hard time arguing that the information in the data is of a higher than ordinal scale type).

    The “direct interpretation” case for normal scores seems rather counterintuitive to me at first sight, although there may be applications in which such a case could be made. I’d be happy about examples.

    • +1

      Lehmann (2009) gives a nice overview comparing the Wilcoxon, normal scores, classical t, and permutation t tests in different situations. He writes that “if the tails of F are believed to be as short as or shorter than those of the normal, the NS test offers the greatest advantage,” while “if the tails of F are believed to be as heavy as or heavier than those of the normal, the Wilcoxon test would be the first choice.” And then he gives a few caveats.

      http://dx.doi.org/10.1080/10485250902842727

      It’s a beautifully written, helpful expository piece, but it assumes a constant additive treatment effect, and I agree with Andrew that people often “overrate the importance of distributional assumptions” and “understate the importance of structural assumptions such as additivity.”

      Here are two helpful pieces on non-additivity (but not specifically about Wilcoxon vs. normal scores):

      Romano’s comment on Lehmann:
      http://dx.doi.org/10.1080/10485250902846900

      White and Thompson:
      http://dx.doi.org/10.1002/sim.1420

      Some other useful pieces on non-additivity are summarized and cited in my paper with Scott Halpern, Meeta Kerlin, and Dylan Small (linked in my comment above).

  18. I have been working on a very similar model, using just ranks rather than transforming into z-scores. Just wanted to share a small technical issue.

    Typically, we consider our data to be independent for most statistical tools such as simple linear regression. However, if we transform this to rank scores, they are clearly not independent: if you know the first n-1 ranks, you know the n-th rank as well. This essentially comes to having the rank scores correlated with r = -1/n. For smaller samples, this should be taken into account, although it may be reasonable to ignore with larger small sizes; when just using the raw ranks, this would be the equivalent to losing 1 degree of freedom.

    It is unclear to me what the correlation of the transformed rank scores would be, but again I would imagine it to be a mild effect in larger samples.

  19. My understanding is that the rank transformation might work for the case of two groups, and perhaps even three under limited circumstances, but would you claim that you could perform more complicated ANOVA-type testing this way for situations that don’t conform to the requirements of parametric tests?

    • Ariel:

      Yes, that’s the point. The more complicated the analysis, the more to be gained by doing modeling rather than testing. It’s fine to modeling the data as is, or if you want a nonparametric approach you can rank-transform the data before fitting the model.

Leave a Reply to Christian Hennig Cancel reply

Your email address will not be published. Required fields are marked *