Questions about data transplanted in kidney study

Hey—check out the above title. It’s my attempt at a punny, Retraction-Watch-style headline!

OK, now on to the content. Dan Walter writes:

In order to gauge longevity of kidney donors, this paper [by Dorry Segev, Abimereki Muzaale, Brian Caffo, Shruti Mehta, Andrew Singer, Sarah Taranto, Maureen McBride, and Robert Montgomery] compares data collected on about 80,000 living kidney donors from a transplant registry with data on about 9,300 people from the National Health and Nutrition Examination Survey (NHANES III).

It appears to me that they sampled the same people from the NHANES study 7 times.

Can you do that?

I asked him where the “7 times” came from, and he replied:

I suppose I should have said 8, because “this cohort was one-eighth the size of the live donor cohort.”

And than there’s this response in a letter to the journal from some other transplant docs: ”

​. . . for the data analysis, 9364 controls were matched (using replacement matching) with 96,127 donors. How many times each control was used was not described. Given that this technique could magnify any differences, we wonder what the effect of using control more than once was on the results.

The thing that drives me [Walter] crazy about medical journal articles is that it becomes obvious over time that the data (such as it is) has been so pulverized and reconstituted that they can make it say whatever they want it to say.

OK, there are two issues here. The first is the treatment of the matched data, the second is the concerns with repeated analysis of non-virgin data.

On the first question, it all depends on how the data are analyzed. I did not read the article in enough detail to really tell if the standard errors are done right. I agree with Walter that you don’t want to treat repeated uses of the same match as independent data, but it’s not clear to me that Segev et al. made that mistake.

On the second question: yeah, that’s an interesting point, and it comes up in a lot of important political science work; see, for example, this fascinating back-and-forth from several years ago involving Jim Campbell, Larry Bartels, and Doug Hibbs regarding the much-analyzed correlations between American economic growth and the party of the president. So I could imagine that kidney transplant data too have this much-pored-over, overdetermined character. That said, you still have to do something, so again it doesn’t imply that Segev et al. did anything wrong.

In any case, both these points are interesting so I’m sharing them with you.

9 thoughts on “Questions about data transplanted in kidney study

  1. The researchers say this: “Although NHANES III is a large, representative, and commonly studied population of potential comparison patients, this cohort was one-eighth the size of the live donor cohort after appropriate exclusions. As a result, in generating a matched cohort based on these patients, we had to sample with replacement (some patients were used more than once in the matched cohort). Although this accounted for confounding by making the matched cohort similar in demographics to the live donor cohort, the oversampling caused an artificially larger sample size for the purposes of standard error estimates. Of all statistical analyses performed in our study, the only one affected was the statistical comparison of the live donor survival with the matched cohort survival, where we found that live donors had a statistically significantly better survival than their NHANES III counterparts. Although it is unlikely that this substantial difference was driven by the artificial increase in sample size, we can still safely conclude that live donors did not have statistically significantly worse survival than their NHANES III counterparts.”

    Would it have been possible to calculate standard errors using the actual size of the comparison group? Or would that not be legitimate?

  2. Along the same lines, was it necessary/appropriate to use a matched cohort here? I did not look into the statistical tests to see if these were actually treated as matched samples or if they were simply treated as two random samples from two populations that were designed to be roughly similar. If the latter, then I guess my question is moot. But if they actually relied on some type of matched sample tests, I am somewhat dubious. I think matched samples are always suspect – except in rare cases. This could be one such case, if we believe that the difference between donors and non-donors is due to some completely exogenous factor such as not being aware of donor possibilities, or just some gut-level aversion to being a donor. However, it is entirely possible that donors are in some way “different” from non-donors. The attempt to match demographic and health status is good and appropriate, but I would not then treat them as matched samples in the statistical sense. They would still be drawn from two different populations.

    Can someone tell me if this is a valid concern with the techniques used in this study?
    Thanks.

    • If you are not matching 1-to-1 and doing a matched statistical test, then the control group is just being weighted, not matched (i.e. control group person 1 gets a weight of 10, control group person 2 gets a weight of 3, etc.).

      I would refer to this as a weighted analysis rather than a matched analysis, even though to derive the weights one would try to do 1 to 1 matching. But that terminology may not be general.

  3. Since the standard errors are so much smaller for true matched sample tests, we would want to know that there really was only one sample – in other words, that the “two” samples differed only in whether or not they were transplant donors. Since these are different people, that is an impossibility. So, I would think that these must really be two different samples. However, if you don’t try to match the most important characteristics (such as age, race, gender, health status), then you unnecessarily would inflate the variability. I guess I would think that you want two samples that look similar in all ways clearly important to the risks involved with transplants, and then view them as two independent samples.

    • It is an issue of inferring to the population the parameter estimates of the model being tested. The effect of matching is an attempt to control for potential confounds that could cause differences in outcomes unrelated to predictors of interest. Without reading the article, my assumption is that the researchers wanted to use all of the data and thus repeated the use of the matched controls. Ideally the number of controls would match the number of donors, but that evidently was not the case. To properly estimate parameters, they would need to estimate errors in such a way that takes into account the dependency across repeated instances of matched controls.

  4. “The thing that drives me [Walter] crazy about medical journal articles is that it becomes obvious over time that the data (such as it is) has been so pulverized and reconstituted that they can make it say whatever they want it to say.”

    This is called data abuse. I came up with an analogy for this a few years ago.

    Say you get a new job, meet new colleagues, and they all tell you about how they have these amazingly behaved dogs. One weekend, a colleague has a party at their home and you attend. The dog is brought out to show off and it is great, all sorts of tricks etc. Then oddly, quickly after the performance is finished, the colleague quickly grabs up the dog and puts it back in the basement. Some people ask to see the dog again later and they are assured this will be possible in a moment, but it never ends up happening. The party ends, everyone goes home.

    The next weekend a different colleague is having a party, and you attend that too. Like before there is an amazing dog performance, but again the dog is oddly hidden away somewhere before and after the performance. This time you had a bit to drink and decide to investigate. You open the last door you saw the dog get brought through, and retract in disgust. The room is filled with the worst dog torture devices imaginable, some of them clearly recently used. It slowly dawns on you in horror, how far does this practice extend? *Everyone* at work has a dog they brag about, is this an entire community of dog torturers?

  5. I am not sure whether it makes sense, but here what I think might be going on. To get statistical significance you have to analyze population of donors and properly weighted (to match demographics, health status etc.) population of non-donors. Suppose you’ve done that and found some small, but statistically significant difference. Now the question is whether it has real life significance. From general public health perspective, if the distribution of life expectancies is much wider than the observed difference, it probably does not matter much. For example if the typical life expectancy is 83+/-10 years than a few months’ difference does not seem like very important. But on the individual level, if you tell a potential donor that her life may be shortened because of donation by a few months, that might be something to think about. But! the statistical procedure must allow for such scaling, if someone wants it. Does it make sense?

  6. What should they have done differently? If they matched 1:1, then they throw away much of the variation in the donor population. Instead of 1:N matching, is the error analysis more straightforward if they did enough replications of 1:1 matching to include all 96K donors?

    Does anyone have a reference for properly calculating the errors with the N:1 matching?

Leave a Reply

Your email address will not be published. Required fields are marked *