Skip to content
 

No, they won’t share their data.

Jon Baron read the recent article, “Presenting Characteristics, Comorbidities, and Outcomes Among 5700 Patients Hospitalized With COVID-19 in the New York City Area,” and sent the following message to one of the authors:

I read with interest your article in JAMA. I have been trying to follow this issue closely, if only because my wife and I are both over 70, bu also because I’m interested in data and statistics.

What I was hoping for when I saw your article was a breakdown of hospital admissions (and perhaps other outcomes) by “secondary conditions” (not sure of the term – but things like diabetes and hypertension), age, and possibly sex. I know that “older people are more vulnerable”, but I, and possibly others, have been wondering whether this is because older people are more likely to have these pre-disposing conditions. Maybe I’m missing something, but it seems relevant to look at hospitalization as a function of both age and other conditions. If it turns out (as it has in two very small studies I’ve seen) that the age effect largely disappears when these other conditions are excluded, that seems interesting. I know we can’t draw firm conclusions, but neither can we draw them from anything else you report, since we don’t know how people get into the hospital.

If you are willing to send the raw data, I [Baron] can answer this for myself. I’m very good with data. I might want to put the results in my blog (if you are willing), but I’m not interested in boosting my h index. I’m retired, and I don’t even apply for grants anymore. . . .

He received a two-sentence reply stating that they would not share any data at this time beyond what is in the published article.

I was curious what JAMA’s policy was on this, so I checked on the website and I found this:

But this study is not a clinical trial, so there is no requirement to share data.

It’s disappointing that researchers are refusing to make their data available, even in a public health emergency.

I guess it’s possible that there are some data sharing restrictions because this is information from live patients. Clinical trials are different because they have consent forms, so you can ask patients at the time of entry into the study for permission to use their data. In this case the researchers are just using patients who come into the hospital. On the other hand, it’s not clear to me why the authors of the study should be able to work with the data but the rest of can’t. They could prepare a de-identified dataset, right?

Anyway, my issue here is not with the authors of this particular study, it’s with the larger culture of data hoarding and the academic rat race.

P.S. Just to re-emphasize a point discussed in the comments: My problem here is not with the authors of the paper. I have no idea what rules they are operating under. My problem is with the general culture of data hoarding, the idea that not sharing data is the norm. It’s great that people can organize this sort of study and collect the data. It seems to me a mistake for the norm to be that the same group that collects the data does all the analyses. It limits what can get done and what can be learned.

65 Comments

  1. Yes, the broader academic rat race that has been the issuance of the special interests. Thus the data have the luxury of patent proprietorship & lobbies in the hood.

  2. Matt says:

    Isn’t this still a HIPAA and/or IRB issue? I’ve found the data sharing and use restrictions to vary wildly by health care organization. Even with just patient arrival data for process analysis I’ve been under severe restrictions on how the (even de-identified) data can be used and shared. I do agree on data hoarding, but I’m just offering a more generous justification that might absolve the authors a bit.

    • george says:

      +1

      It’d be really helpful to know what that two-sentence reply said. Important to distinguish between authors refusing to share versus their being prevented from sharing.

      • Clyde Schechter says:

        I completely agree. Over the last decade or so institutions I deal with have been making data access more and more restricted, and also adding new and more complicated layers of bureaucratic delays to the process when it is made available. Typically they say they are forced to do this because of confidentiality concerns (specifically HIPAA), but HIPAA has now been around for a long time and has not been amended–so it seems that this is either just an excuse or reflects changed perceptions of what HIPAA requires or how it is enforced. I’m not aware of any upsurge in prosecutions for HIPAA violations that might motivate this, but I might just not be in that loop.

        In any case, it really needs to change. But probably that would require an act of Congress, which means don’t expect it to happen any time soon.

  3. Witold Wiecek says:

    I don’t know if the researchers are making a decision on that. Can easily imagine (and I could be wrong) that the practical choice was between getting it past the institutional approval with minimum delay, or doing exactly what they wanted with data.

    > The Northwell Health institutional review board approved this case series as minimal-risk research using data collected for routine clinical practice and waived the requirement for informed consent.

    “They could prepare a de-identified dataset” — that, too, would have to be argued with data holder, because anyone can always claim that data were not really anonymised, so I guess (and may be wrong) that you end up explaining how you are going to do it to a lot of people who 1) don’t care, 2) don’t understand these issues, 3) would rather avoid any responsibility.

  4. Richard Nerland says:

    Which part of making the data public is most important?

    Do you want the data so that you can double check the results with your own statistical inference?

    Do you want the data to be public so that we can aggregate data across data sets?

    https://www.statnews.com/2020/04/22/people-are-dying-from-coronavirus-because-were-not-fast-enough-at-clinical-research/

    Articles like this are calling for changes but are still thinking the ultimate goal are giant RCTs (The gold standard as everyone knows). I feel like you have a separate voice on this issue.

    Would we be able to draw strong enough inference in a time like this on hydroxychloroquine with a meta analysis with Mr P if all of the smaller studies simply had de-identified data to aggregate?

    • Andrew says:

      Richard:

      Baron explained in his quoted email why he wanted the data to be public. It was so he could perform some analyses that were not in the published paper.

      More generally, I favor a division of labor. It’s great that people can organize this sort of study and collect the data. It seems to me a mistake for the norm to be that the same group that collects the data does all the analyses. It limits what can get done and what can be learned. Also, yes, if you want to go to the next step to do meta-analysis, it would be better to have the raw data than to rely on published analyses, which we know are often flawed.

      • Jeff says:

        Yes, we are severely hampered in almost every meta-analysis to get rich enough data from only aggregate, trial level reporting. There is often interest for example to explore differential effects by combinations of patients factors. Research and understanding could benefit if there were a feedback loop back into the design of future individual trials.

      • Josh Rushton says:

        The point about division of labor is a good one. Not only is the scientific advantage glaringly obvious, but it also points to a potential norm that could realistically navigate academic ego/careerism in a way that is much more productive than what we do now. I.e., it’s easy to imagine a norm where initial analytical findings are understood as preliminary — just an advertisement for the data — and where subsequent work that uses a data set always splits credit with the data collectors whose ingenuity/vision made it possible.

        …at least in the short run. I could also see this turning academic prestige on its head if too many big breakthroughs were made by un-pedigreed riff-raff. :)

  5. Andrew says:

    Matt, Witold

    As I wrote in the above post, I don’t think the issue is with the authors. Rather, it’s with the general culture of data hoarding, the idea that not sharing data is the norm. I understand that there are lots of rules—but these rules have the negative consequence of making data hoarding the norm, which slows down research, which in turn is the goal of doing this work in the first place.

    • Ben says:

      > it’s with the general culture of data hoarding

      Sure data hoarding could be a culture thing, but Matt and Witold point out it could be a legal thing.

      If it is a legal thing, then it probably isn’t fair to blame the authors. Just the paper can’t be validated beyond trust, so that’s that (which you said).

      If it isn’t a legal thing, then you do have an issue with the authors, cause they’re refusing to release data.

      > It’s disappointing that researchers are refusing to make their data available, even in a public health emergency.

      If someone came to you and wanted to collaborate on some data analysis for coronavirus, would you refuse to do work without an agreement that you could release the de-identified data? How can we break the data hoarding mentality?

      Bob pretty much does this — he’ll just outright say he doesn’t want to hear it if it isn’t public. But that’s a tough policy.

      We could also just not draw any conclusions from the paper, but that isn’t ideal either.

      I might be drifting off topic though. There’s probably a follow-up e-mail to be sent to the authors (Is there a legal reason I can’t have the data? Where did you get the data; perhaps I can ask there for access?).

  6. Zhou Fang says:

    > The Northwell Health institutional review board approved this case series as minimal-risk research using data collected for routine clinical practice and waived the requirement for informed consent.

    This is probably the critical sentence for the data issue. I guess they ruled narrowly.

  7. Jonathan says:

    I’d much rather see them focus on getting te data for April analyzed. This is March 1 to April 4. Clinicians could really benefit from an April analysis. Example would be how ventilation practice changes affect mortality. It would be nice to see numbers associated with the passed around or anecdotal material that has suggested avoiding mechanical ventilation. So rather than focus on the fact they appear unable to share the data – I think because they had to get a waiver of informed consent – I like to note that this kind of really early clinical analysis should not only be done regularly but updated regularly so it can feed into better clinical practice and thus better outcomes.

    • Martha (Smith) says:

      “this kind of really early clinical analysis should not only be done regularly but updated regularly so it can feed into better clinical practice and thus better outcomes.”

      Makes a lot of sense. It’s like quality assurance in manufacturing — you collect and analyze the data to improve practice.

  8. jim says:

    I support data sharing. But if the authors invested effort to get data, they have the right to first use. So if they used some of the data for an analysis and still want to do further analyses on the remaining data, then it’s not appropriate to claim they are “data hoarding”.

    • Umm… what is “data hoarding” if not keeping the data to themselves because they want to be the only ones doing further analysis?

      here I think probably they’re keeping it to themselves because privacy issues… but in the broader problem in academia, your point is that when you invest time and effort to get data collected and collated, you should get some benefit, some payment…. And I agree. PAY them with grants for putting data together. But keeping it to themselves? No way. That needs to be *disincentivized* not normalized.

      • jim says:

        “Umm… what is “data hoarding” if not keeping the data to themselves because they want to be the only ones doing further analysis?”

        Why on earth would anyone put out the effort to gather data if other people are going to take it before the people who gathered it even use it?

        If that’s what you expect then you’re a data freeloader.

        • jim says:

          It’s appropriate for people to provide data for published analyses insofar as that’s legally possible. No more, no less.

        • No, it’s entirely appropriate to get paid *just to collect and provide data*. You know, like polling companies, and the Census bureau.

          The vast majority of university health research is publicly funded. It’s *OUTRAGEOUS* that people take money and create a private little garden for themselves. Imagine if I took tax money designed for roads and paved a 3 mile long private driveway behind a gate to my mansion.

          • jim says:

            “Imagine if I took tax money designed for roads and paved a 3 mile long private driveway behind a gate to my mansion.”

            Huh??? Not even close.

            The apt analogy to asphalt is that you obtain the asphalt with public money; it can be withheld from public use until it’s used in a road, at which point it must be made available for public use. WTH good is the asphalt in a pile behind a fence? Chances are good it will go in a road sooner rather than later, at which point the public use stipulation becomes effective.

            Researchers have ample incentive to publish. The longer their data remains unused, the more likely they are to be scooped by someone else, at which point their effort to obtain the data may be wasted. Ask Darwin.

        • Eric says:

          This was the same (misguided) argument made back in the 1980’s regarding protein crystallography. Researchers would spent months and money purifying a protein and obtaining data to solve a protein structure. Then they’d try to milk it for as many publications as possible. Tools were even made (remember in the 1970’s-1980’s) to “extract” the data from figures and graphs.

          Then in 1989 the IUCr published their policy on data deposition — if you don’t deposit your (processed) data, you didn’t get to publish, no exceptions. Publications, protein structures, and patents exploded! From 71 structures released in 1989 to 9,664 in 2019. Yes, its been easier to do the science (and point out other’s mistakes) but importantly data sharing HAS NOT slowed down science. Sharing has helped the field.

          https://www.rcsb.org/pages/about-us/history

          I’d love medical sciences to catch up with standard practices and data sharing that has been common in some physical sciences for the past 30 years.

          • jim says:

            “This was the same (misguided) argument made back in the 1980’s regarding protein crystallography.”

            No, it’s not. You publish the data used in your paper. Nothing more, nothing less. The point of publishing data is to support the analysis for and allow challenges to a given paper.

            If you force people to release unpublished data what you’ll get is a lot less data. it’s tantamount to demanding that Bill Gates hand over his unused dollars. Ridiculous.

            • Dale Lehman says:

              I disagree with your position. If we assume that the system will not reward careful data collection and release (as is currently the practice), then you are right that forcing data release is likely to reduce the amount of data that is released. However, as I and others have repeatedly argued, the rewards for data collection should be dramatically increased – and, perhaps, decreased for analysis. If we change the rewards along these lines, I think data availability would increase and many faulty (either intentional or unintentional) analyses would decline in prestige. It is precisely the current reward structure that is the problem – the only way you can earn the rewards from the data you collect is by publishing analysis of it – and the trend is that the analysis must be statistically significant and the findings must be dramatic.

              Nothing in this position prevents the people collecting the data from having the first opportunity to publish their analysis. They can do that at the same time that they release the data. It is the protected time period between their publication and release of the data that causes many of the problems we have seen.

              • jim says:

                I’m not sure what you’re suggesting is useful or practical. Data acquisition in science is funded by research proposal for many good reasons:

                1) the data required is specific to a single research question and not directly useful for other questions;

                2) data collection is expensive and time consuming, so just collecting data that doesn’t have a specific use intention isn’t cost effective;

                3) in the case of medical data, just collecting a broad swath of data would complicate privacy matters even more and make the data even more difficult to share.

      • Chris Wilson says:

        I agree with the spirit behind what Daniel is saying. However, several real world factors are at play. First, given the existing incentive structure in academia the point of getting grants isn’t to do rigorous experiments, collect data and share it, it is to fund labs and generate publications that in turn lead to more grants and academic promotion. The quality of the science is only in formally maintained by individual commitment to it and culture within certain sub fields and disciplines – as Andrew has documented on this blog over the years, there is plenty of cargo-cult junk passing for science that still apparently gets grants, publications and builds labs and careers. There is literally no line item in a tenure and promotion packet for ‘datasets collected, documented and shared’. And remember too that ‘novelty’ is still a criterion to publish in many top journals.
        So, all in all, given how much work (and luck) is involved in obtaining grants, it’s unreasonable to not let the researchers have first shot at publication with their data. Once you pull on that thread you have to basically reform all of academia. I would be in favor of that, mind you, but it’s gonna he hard work and damn near impossible to piecemeal it.

        • jim says:

          “Once you pull on that thread you have to basically reform all of academia.”

          Yeah, it’s not just academia. Healthcare research probably has billions of dollars of non-academic funding.

      • Ryan King says:

        They aren’t census or a polling company. Their future pay / jobs are dependent on getting publications, not generating datasets. While there are some grants for “generate this data, we’ll figure out what to do with it later” that isn’t most grants, and most researcher’s wouldn’t get a favorable eval for generating data and then getting scooped on most of the analysis. Any dataset only gives the author 1 publication with certainty if you’re required to share everything.

        The “share what supports the article” proposal fails on this example (the tables are all you need to support the claims, not individual level data).

        MDclone is a company that our institution works with to create synthetic “non-patient” data that closely simulates patient data to get around privacy rules. They aren’t open about how the simulation works, but it probably is fine for exploratory purposes. I think a standardized de-identifying synthetic pathway would be a nice way to share almost-real data. Your IRB has to accept that it works once, and then any proposal can use it even with a waver. For “real” analyses of course you’d like to use the real data to make sure the simulation didn’t generate artifacts.

  9. Craig K says:

    Andrew,

    Regarding your quotes in the Undark article about Ioannidis. It was very hard to believe I was reading them as coming from you because I read them as having the opposite message of hundreds if not thousands of posts on your blog.

    • Andrew says:

      Craig:

      What quote are you talking about?

      • Craig Kaplan says:

        Andrew: this section:

        “I think they have reasons for believing what they believe beyond what’s in this paper. Like they have their own scientific understanding of the world, and so basically they’re coming into it saying, ‘Hey, we believe this. We think that this disease, this virus has a very low infection fatality rate,’ and then they gather data and the data are consistent with that.

        “If you have reason to believe that story, these data support it,” Gelman added. “If you don’t have such a good reason to believe that story, you can see that the data are kind of ambiguous.”

        >What I don’t understand perhaps relates to the opacity of what “good reason” means- if the reason is not based on evidence, is it a “good” reason. You have posted quite a bit about people fooling themselves for example on the garden of forking paths etc. This passage comes across as “views can differ on shape of the world and that is OK”. There does seem to be a strong political component to the potential reasoning of these authors, and I felt your quote could be read as an argument for motivated reasoning. Is just felt incongruous to the philosophy behind your discussions here.

        • Andrew says:

          Craig:

          It seems clear to me that the authors of that study had reasons for believing their claims, even before the data came in. They viewed their study as confirmation of their existing beliefs. They had good reasons, from their perspective. Their reasons are based on their larger understanding of what’s happening with the coronavirus. They have priors, and what I’m saying is that the data from their recent surveys is consistent with their priors. I think that’s why they came on so strong.

          But, as I said, if you don’t have such a good reason to believe that story, you can see that the data are kind of ambiguous. I don’t know enough about the epidemic to have strong priors. So, to me, these surveys are consistent with the authors’s priors, but they’re also consistent with other priors.

          I’m not arguing for motivated reasoning. I’m arguing for understanding other people’s reasoning.

          This is kind of a tricky point; maybe I should post on it.

          • Craig Kaplan says:

            I think a post would be welcome. But here isn’t it important to ask questions about their priors? I can’t imagine argument from Bendavid and Bhattacharya about NBA player infection rate (in Wall Street Journal a month ago) could be considered a strong foundation for their prior. Again, is it question begging to assume a “good reason” is behind their beliefs? Even if there were, to what extent would this reason support better their interpretation and press avalanche vs a more careful (and sophisticated) statistical approach? There are other issues here that I think are unwise to divorce from examining how they interpret the data. This is really quite a huge deal. The reasoning of these authors has been examined due to their going on record as supporting this (widespread infection) as a possibility, which of course it is, because if it is not examined it remains possible – but it is a model based on possibility not evidence (fine to test). It is an alternate hypothesis to a high fatality rate. But this discussion misses the point for an average person, a lower fatality rate (perhaps similar to influenza) but a high attack rate (much worse than influence) still means many deaths. Ignoring that many are making the argument that COVID-19 is “really just the flu” when examining what these specific results mean is not a great idea. There is a lot going on surrounding the context of this particular study. Their further study (only existing in a leaked preprint posted on RedState and then deleted) and further questions about methodology make this a fraught situation.

    • Mendel says:

      Saving the curious a google search:

      https://undark.org/2020/04/24/john-ioannidis-covid-19-death-rate-critics/

      The issue, suggested Andrew Gelman, a statistician at Columbia University who criticized the research team for statistical errors in a post on his popular blog, is that the team is getting out in front of what the data are showing. “It’s not like I’m saying they’re wrong, and someone else is saying they’re right. It kind of comes out that way, but it’s like, yeah, I think their conclusions were a bit strong,” Gelman said. “I think they have reasons for believing what they believe beyond what’s in this paper. Like they have their own scientific understanding of the world, and so basically they’re coming into it saying, ‘Hey, we believe this. We think that this disease, this virus has a very low infection fatality rate,’ and then they gather data and the data are consistent with that.

      “If you have reason to believe that story, these data support it,” Gelman added. “If you don’t have such a good reason to believe that story, you can see that the data are kind of ambiguous.”

      • Andrew says:

        Hi, yes, that’s completely consistent with what I wrote on the blog earlier! So I’m confused about what the question is.

        • Craig Kaplan says:

          Adding for what it is worth- many people I have discussed this with did not read that comment as congruent with your blog post at all- in fact they found it sort of striking, as did I, which is why I raised it.

          • Kaiser says:

            I think many of your readers (including me) think you toned down the criticism a bit. I know you said it is possible if we assume close to 100% specificity. But you also convinced many people that 100% specificity is not a plausible assumption. Many people now believe the study is flawed. The ground shifted; it now looks like you are on the other side saying it may be flawed.

            I don’t understand why they can’t do their reference experiment with 300-1000 samples of pre-covid blood, then the issue of specificity is settled.

  10. Zad says:

    Well ok, will they at least share their scripts? Wait, I forgot… these sorts of studies are usually done on Excel or point and click software and as we saw with the surgeons and post-hoc power fiasco, no one remembers what they clicked anyway so they couldn’t possibly share anything

  11. Thomas says:

    A bit of dissent here.
    Baron is not interested in replicating anything, nor in pooling evidence from several studies. He’s not interested in science either, since he doesn’t plan to publish the results of his analysis in a journal of record (this he justifies by not caring about his h-index, which is telling). He’s a guy with a blog who wants to play – not that there’s anything wrong with that.

    So if he has a worthwhile scientific idea, why not just suggest to the authors that they do the analysis? They know the data, they know what’s problematic (eg, how do you deal with patients who show up with high blood pressure at the ER but no diagnosis? is it undiagnosed hypertension or just a temporary spike?), their scripts are ready.

    But maybe his idea is not so useful. He says “it seems relevant to look at hospitalization as a function of both age and other conditions.” But this is a case-series, 100% of the study participants are hospitalized, so you cannot do that. One could use population-based data to create a denominator, and this is doable for age, and probably for each comorbidity taken separately, but I doubt there is a joint distribution available somewhere for all these variables.

    So I can understand the authors just saying politely No.

  12. OB says:

    Data should always be available for checking inference, but please do not lose sight of the fact that data collection among the hardest works in science.

    You have no more right to another’s work than you do to their words.

    If you have an idea, approach a research group humbly, and make a pitch to collaborate with them. And don’t be surprised if some grad student who is toiling over data collection hasn’t already thought of your idea. PIs need to protect the vulnerable on their team who are doing the heavy lifting.

    I love this blog, but the sense if entitlement here, among those who I suspect never lift a finger for data collection, is shocking.

    • Andrew says:

      OB:

      You write, “data collection is among the hardest works in science.” I agree! That’s why I think it’s incredibly wasteful for this valuable and expensive resource not to be used efficiently. We’re in the middle of a goddamn pandemic. People are dying and many more people are losing their livelihoods all over the world. We as a society should be making best use of our valuable data.

      These are data collected from our tax money and our health insurance payments. It’s great that people put in effort to collect these data—also, it’s their job. The idea that we should have to “make a pitch” . . . that just makes me want to scream. Meanwhile, the economy is falling apart all around us.

      As noted in the above post, I’m not blaming this particular research team, which well may be operating under legal restrictions for data sharing. My problem is more with the larger system, which seems focused on career promotion more than scientific understanding. It’s also illogical to think that a group that’s particularly good at data collection should also have analytical skills. Or that a group that’s particularly good at measurement should have any data collection skills. Recall the recent Santa Clara study disaster, and consider the principle of division of labor.

      Also consider the incentives: If data sharing becomes more of a norm, I’d expect to see cleaner data sets, fewer errors, and better analyses.

      • Keith O’Rourke says:

        > I’d expect to see cleaner data sets, fewer errors, and better analyses.
        Much better!

      • Ben says:

        > As noted in the above post, I’m not blaming this particular research team, which well may be operating under legal restrictions for data sharing. My problem is more with the larger system

        > We as a society should be making best use of our valuable data.

        Did NY state ever release the info behind their coronavirus numbers: https://statmodeling.stat.columbia.edu/2020/04/23/new-york-coronavirus-antibody-study-why-i-had-nothing-to-say-to-the-press-on-this-one/ ?

        Can we badger them?

      • OB says:

        I get it. The village is smarter than any chief. We need a smart village. What arrangements will work?

        Among those who post here, there looks to be a greater density of arm-chair statisticians/ bloggers than of research scientists. What is the experience of research scientists? Consider the other side.

        (1) You are already part of a collaborative team of experts that includes data scientists and statisticians, as well as other highly-trained specialists. Your long experiences with other specialists humbles you. You are suspicious of wizards who think they can unlock the deep mysteries, save the economy, save lives… and this makes you weary.

        (2) Your not an evil self-aggrandising cyborg. You love science, and you don’t want people to die or lose work. You’re open to a good idea. Your time is cut thin, but you read every email. You discuss ideas with your team.

        (3) Some good ideas come your way, and you extend the team. You are frequently collaborating. Data sharing within your team is your norm. You grow this village.

        (4) You are regularly approached by nutters. You care about science too much to let the creeping messiahs shape opinion.

        (5) You are regularly approached by sincere armchair statisticians and bloggers who have ideas that already occurred to your postdoc last month, and she’s been working on it. You are not going to throw her career under the table — she’s doing the heavy lifting in data collection — but you’re open to accelerating her work through a productive collaboration. You need a reason to discuss his with her. “Hey, can I please have your data (to do the obvious work, without the benefit of internal checks from the experts on your team…)” is not a compelling reason. But occasionally, you extend the team to make it more efficient, even for work that is already ongoing, because the collaboration made sense.

        (6) You believe that every published study should be scrutinized. Anonymised data should be available on request for the purposes of checking the inference. You agree the Santa Clara study is a good illustration of the need for checks. You have experienced the shortcomings of peer review, but you aren’t willing to give it up: rather you want it to be more vigorous checks. You don’t confuse respect for data-collection with deference. You’re a fan of this blog, and you’d be OK to retract when errors are pointed out….etc.

        (7) You read novel coronavirus reports, and you are appalled by the condition of the datasets and the inferences. You want repeated measures on individuals, ordered within time and a nesting of locations, and you want consistent indicators, standards, quality checks, attention to measurement problems … what you find instead is a mess, your heart sinks… However, you are puzzled by the suggestions that these problems will be solved by data sharing, when what you need for every group is Rahul ;)

        (8) Your experience makes you sceptical that handing over data on demand is the best we do to accelerate the pace of scientific inquiry. Bloggers still gonna blog, haters still gonna hate, and the village isn’t any smarter for it. You are suspicious of all imagined utopias. Given that people require incentives, how will the incentives work? What would be the role of the statisticians already on the team? How does a PI protect their postdocs? You’re already inhabiting a very smart village.

        For now, I’d suggest that armchair statisticians and bloggers approach research groups with an appreciation that the hardest part of science is data collection. Vigorously check every published result. You should be able to obtain anonymised data on request for the purposes of checking inferences in the public domain. However, you have no right to use the data for other purposes. If you have an idea, offer to help in case you might be useful. Consider offering to do so without credit. Model the better angles you hope to see.

    • Dale Lehman says:

      OB:
      “I love this blog, but the sense if entitlement here, among those who I suspect never lift a finger for data collection, is shocking.”

      I can’t help but point out that this is but one step (short or long is debatable) away from “methodological terrorists.”

  13. Ryan King says:

    I’ve run into all of the above from both directions.
    – If you have put a lot of effort into generating a dataset, the analysis is frankly the easy part. Nobody has an incentive to generate a dataset and have other people publish a (possibly bad, you have no control) fast analysis and take the credit for any novel findings. The more money and effort went into creating a dataset the more need there is for the creators to get value from it.
    – There is an opportunity cost disincentive. There are some high-profile public or semi-public data sets (e.g. MIMIC), but these show how much effort is needed to deidentify and meet adequate safeguards. It’s hard for me to put in months of work to make data useful to other people when I have my own experiments / papers / grants to be working on. I have no guarantee anyone will care! Do I become responsible in some way for the interpretability / accuracy of the dataset? For example, I may have known certain values are likely data entry errors, do I have to painstakingly annotate all that? Am I responsible for you understanding my preprocessing code?
    – Most big data projects are done with a waiver of consent, and the IRB has to rightfully consider if people would have agreed had they been asked. Given the lawsuits against UChicago + Google for sharing data to develop early warning systems, it’s apparent that some nontrivial number wouldn’t give blanket authorizations. Polling data and the movement for personal data ownership backs that up.

    For EHR projects, I’ve generally stated that if you, the curious party, files the IRB paperwork and gets approval for access to the same data, I would happily share the raw data. It took probably a year FTE to adequately process and make sense of our EHR’s output. If you want that year of effort, then some kind of relationship on the work is necessary.

    Clinical trial data is a little different. The expectation of sharing is often just a few columns, and de-identification (not including e.g. detailed path reports) is typically easy. Clinical trials are already expensive (usually funded) projects with a data coordinating center and safety monitor.

    • Ben says:

      > If you have put a lot of effort into generating a dataset, the analysis is frankly the easy part

      This.

      But also I think it’s unfair to put out the analysis without showing the data. They go hand in hand. Like, ostensibly they can be separated but it’s really not like that.

      I agree with Dale that whatever conclusion we come to it can’t be “that the people that collected the data are entitled to withhold it (at least for a period of time) so that they can reap the rewards of their efforts”. Maybe that’s how it is now but I believe we should change that.

  14. Dale Lehman says:

    So many ways to point out how unrealistic it is to ask for data sharing in the current world….

    I agree with the portrayals of how impractical it would be to change things. But change is what is needed. It is well within human control to change the rewards structure in academia and research. Currently, we give little credit for curating and documenting a high quality data set that people find useful – instead, we overemphasize (in my opinion) the analysis and headline-grabbing results of an analysis of that data. And, increasingly this analysis is highly technical, too long for anybody to actually check, and by the time someone finds errors or misrepresentations, the first analysis has already had an impact (Andrew has labeled this phenomenon, but I forget what he called it).

    I happen to believe things are getting worse, and doing so quickly. As data gets larger (I’m sure I’ve mismatched my grammar there) and the analysis tools get more advanced, we are seeing rampant analysis and claims being generated far more quickly than the ability to check or examine the claims. This fits well into a twitter-driven, attract people’s attention world that involves many of the problems expounded upon on this blog. Data availability won’t cure these problems on its own, but I think it is a necessary ingredient in addressing them.

    There are important differences between fields. In science, perhaps much of the data is associated with grants, and these may entail fairly specific purposes (although the data may still be useful for other purposes than the original intent). In social science (where I live), things are a bit different. Much of the analysis is aimed at influencing policy – whether grant-funded or not, I think any analysis that purports to affect policy MUST release the data. This happens in regulatory proceedings (often subject to NDAs), but not regularly in publications (I acknowledge that journals increasingly have data release policies, but I’ve found they are uneven in application, and too often the data release involves a one page pdf explaining that the data is “proprietary”).

    In medicine, it is more complicated due to privacy concerns and ethical and legal requirements. I don’t think there is an easy answer, though I think we have gone far too much in the direction of preventing data release. Recall the NEJM competition on releasing clinical trial data that was held 2 years ago. The issues are complex and there are many points of view. I don’t pretend the solutions are simple or clear-cut. But, there is one defense of data protection than I reject – that the people that collected the data are entitled to withhold it (at least for a period of time) so that they can reap the rewards of their efforts. The reason I reject this argument is that it is totally under our collective control to change the reward structure. This does not make it easy to do so – but my position is that it is essential that we start doing so.

    • Sing it friend. For example, retract funding from a grant unless the data is deposited in a publicly accessible data repository. The End. This will be the new normal in a couple weeks and everyone will sing about how wonderful it is that we’re all sharing data. Press releases from Universities will tout the benefits and how well they’re all doing at “enabling the next generation of synergistic cross disciplinary research in the cloud” or whatever.

      The NIH required all NIH funded research to be published in a form without a paywall in maybe 2009. Voila.

  15. Klaas van Dijk says:

    I am working together with others to get retracted a fraudulent study on the breeding biology of the Basra Reed-warbler, see for backgrounds https://osf.io/5pnk7/ and https://www.pepijnvanerp.nl/2020/03/the-basra-reed-warbler-case/

    Below some quotes about our efforts to (communicate about our requests to) get the raw data of the fraudulent Basra Reed-warbler study. These quotes are from the manuscript “Publisher Taylor & Francis refuses to retract a fraudulent study on the endangered Basra Reed-warbler”. This manuscript is at the moment under consideration at ‘Research Ethics’ https://uk.sagepub.com/en-gb/eur/journal/research-ethics

    “The first author soon refused to release the primary research data. His motives (‘this would formally be a one-sided review’ and ‘the EiC already acted as independent and impartial referee’) are listed in an e-mail of 15 June 2015.”

    “Last author Filippo Barbanera and his affiliation, the University of Pisa in Italy, have never responded on requests to release them.”

    “no university in Saudi Arabia had endorsed this study and the affiliation of the middle author (‘University of King Abd Al-Aziz, Riyadh, Saudi Arabia’) does not exist and had never existed.”

    “EiC Max Kasparek has never responded on a request from 21 June 2015 to get access to the data.”

    “TF responded on 16 June 2016, almost one year after my initial request and after I had sent them several (daily) reminders. TF told me that they were not willing to provide us access to the raw research data. Their motive (‘Like the majority of scientific journals, this one does not compel the author to provide the raw data of the research to anyone. We will not be responding to your request to provide you with this.’) is in strong contrast with proposals for an improvement of transparency in the field of ecology and evolution in Clark et al. (2016).”

    “A new request for access to the raw research data was sent to TF on 20 May 2019. A response was received on 13 September 2019. This response does not contain (parts of) the raw research data. Motives about its unavailability are not mentioned.”

    “Correspondence in April-May 2019 about an earlier version of this manuscript with Alan Lee, EiC of the TF journal Ostrich https://www.tandfonline.com/loi/tost20 , ended with a statement in which Alan Lee declared that he had no scientific opinion about any of the topics in the manuscript because, (a) he had never visited Iraq, (b) he had never visited Iran, (c) he had, towards the best of his knowledge, never observed a Basra Reed-warbler, and, (d) he did not know anyone connected to or associated with the journal ZME and with the Basra Reed-warbler study. This statement is dated 31 May 2019. It does not refer to our repeated requests to get access to the raw research data.”

    “Extensive correspondence between October 2019 and March 2020 about an earlier version of this manuscript with Pippa Smart, EiC of the Wiley journal Learned Publishing https://onlinelibrary.wiley.com/journal/17414857 , ended on 4 March 2020 with a 995 times repeated decline to communicate about the existence of the raw research data.”

    “It needs to be underlined that no response on queries for access to the raw research data, even if they are repeated 995 times, does not automatically imply that it has been proven that the data do not exist.”

  16. Kaiser says:

    “But this study is not a clinical trial, so there is no requirement to share data.”

    This is very disturbing. For observational data (such as data collected from online forums, apps, sourced from businesses), I’d think the default should be data must be shared. With an RCT, you have some assurance of quality because of the design protocols. With “big data”, there is nothing, no standards, nothing at all. It’s much more important to see the data to know what are being analyzed.

Leave a Reply