“We received only one of ten raw data sets requested”

Chris Wiggins sent me a link to this article by Caroline Savage and Andrew Vickers, which, as he puts it, “takes an empirical approach to revealing the community’s publishing practices.” Here’s the abstract:

Many journals now require authors share their data with other investigators, either by depositing the data in a public repository or making it freely available upon request. These policies are explicit, but remain largely untested. We sought to determine how well authors comply with such policies by requesting data from authors who had published in one of two journals with clear data sharing policies. . . .

We received only one of ten raw data sets requested. This suggests that journal policies requiring data sharing do not lead to authors making their data sets available to independent investigators.

Not good. Personally, I hate it when people don’t share their data. I’ve found researchers in biomedical sciences to be particularly bad about this, possibly because (a) these are big-money fields where the investigators are just too damn busy to reply to requests, and (b) pain-in-the-butt Institutional Review Boards make it difficult to share data. Bad stuff all around, and maybe Savage and Vickers’s paper will be a valuable wake-up call.

12 thoughts on ““We received only one of ten raw data sets requested”

  1. Point (b) about IRBs is an interesting one. At some point soon, scientific fields that collect human data are going to have to revise their data-sharing norms because of new techniques for re-identifying cases in anonymized data. For what it's worth, I doubt most investigators' reluctance really stems from that — people have been complaining about poor data sharing for years. But it's going to add a new wrinkle.

  2. Another issue is that requests for raw data are very often followed up by requests for assistance in analyzing said data. Now, the concern is that with the raw data, responsibility comes to include a level of support sufficient to allow the individual to replicate the original results. This is often nontrivial, particularly data such as medical imaging. So, you get a request for raw data from someone that you don't know from Adam, and the first blush response is to delete the message. I don't do this but I do google the person, find out who they are, where they are, and what they are doing. Minimally it helps me know what kinds of things to tell them in addition to making the data available, and how much inthe way of disclaimers to put in the message.

    The upshot is that sharing raw data can be extremely time consuming, often with very little consideration/gratitude on the part of the data recipient. It often is not simply dropping the data into an email and hitting send. I'm not surprised the response rate was so low.

  3. If the journals were really serious, wouldn't they require you to set up the data archive prior to publication? And wouldn't they have some explicit policies in place on the conditions under which the data should / need not be shared? (e.g. model NDAs)

  4. There are sometimes good reasons for not making data available; they may contain sensitive information, for example, or may have been collected under a law which forfends them from being used except in specific circumstances. But in most cases there's no real reason for not making them available, except that it puts more work on the data "owner" and may let someone else publish something that stops the owner from doing likewise.

    To me, the best solution is for journals to require that the data be made available via the journal webpage from the time of online publication. This would bring data availability more clearly into the minds of authors when selecting a journal in which to publish (Don't want to make your data available? Don't submit to Science! etc). It would also limit the work the authors then have to do, replying—or ignoring—requests for data. Knowing that other people will use the data would encourage authors to ensure they are tidy and well documented, which would limit, if not remove entirely, the problems Dan refers to.

    As a completely unrelated bonus, it would also make finding nice, modern, illustrative examples for stats classes much easier!

  5. It's great that this issue is getting additional exposure.

    Re: Sanjay's link on re-identification: A lot of people have been talking about how cases can be re-identified based on demographic and other information that is meant to be anonymous. This does have implications for how data is shared. For example, combinations of variables like zip codes, age, gender, and date of birth are clearly problematic. However, for the vast majority of researchers who do research that does not depend on re-identifiable demographics, I can't see this being a major barrier for data sharing.

    Consideration of the risk of re-identification is an important component of any data sharing process. Thinking of my own research in psychology, I can see this process leading to exclusion of certain variables from the shared dataset. The data that remains, however, would still be useful to others.

    Re: Dan's comment about time demands and requests for hand-holding
    I agree that such demands at present may take an excessive time commitment given the busy schedules of researchers. However, it would be even better if researchers did not need to ask the original researchers for the data or information about its collection. Rather, data, documentation of the data, and analyses would be kept in an accessible form. Preparation of such data, documentation, and code for analyses could be part of the journal article submission process. Thus, there would be additional demands at the time of preparing a journal submission, but few demands afterwards.

    For this to be successful there needs to be greater rewards given to effective data sharing by all parties involved in the reward giving proces: i.e., journal editors, academics on hiring and promotion boards, grant giving bodies, etc.

    Additional training might also be required for scientists on how to write up reproducible research using concepts like literate programming and technologies such as Sweave.

    I speak a little more about this issue on my blog:

  6. I work in epidemiology, where it is rather common to follow a group of selected individuals through a couple of decades, and see whether they develop certain diseases. The same data can give you 10+ articles, so making the data available after the first publication would amount to shooting your own leg. In addition, anonymizing the data well, as required by the legislature, is not simple, and often takes considerable effort, or requires omitting or scrambling some key variables from the shared dataset. Yes, it would ultimately be nice to see more data shared, but enforcing the researchers to do it might not always be the most productive way to get it done.

  7. Data alone is insufficient to permit replication; the code has to be archived, too. If the data and code together are archived, then there is no need for handholding.

    See, e.g.,

    Richard Anderson, William H. Greene, B. D. McCullough and H. D. Vinod
    "The Role of Data/Code Archives in the Future of Economic Research"
    Journal of Economic Methodology 15(1), 99-119, 2008

    B. D. McCullough, Kerry Anne McGeary and Teresa D. Harrison
    "Do Economics Journal Archives Promote Replicable Research?"
    Canadian Journal of Economics 41(4), 1406-1420, 2008

    B. D. McCullough
    "Got Replicability? The Journal of Money, Credit and Banking Archive,"
    Econ Journal Watch 4(3), 326-337, 2007

    B. D. McCullough, Kerry Anne McGeary and Teresa Harrison
    "Lessons from the JMCB Archive,"
    Journal of Money, Credit and Banking 38(4), 1093-1107, 2006

  8. If someone has serious disagreement with the claims, in effect, they are locked out of getting the data until the last person dies or the the grant runs out. Everyone is left with "trust me" which is not science at all. Policy and science both suffer.

  9. "I work in epidemiology, where it is rather common to follow a group of selected individuals through a couple of decades, and see whether they develop certain diseases. The same data can give you 10+ articles, so making the data available after the first publication would amount to shooting your own leg."

    I think this is exactly it. People have an interest in collecting and milking datasets, rather than sharing them.

    The other issue is something that Andrew was talking about in another post about how statisticians having enormous power to rubbish other people's research. The worry is that someone else might do a better analysis than you showing something different. This places people whose careers have been devoted to data collection, but who aren't that familiar with statistics, is a vulnerable position with regard to their publications.

    I think (and it might be heretical to post this on a stats blog) fundamentally the problem might be that the analysis of data is given too much credit. It's analysis that gets publications 'there's a link between X and Y', but often the hard part is data collection 'I got far better data than anyone else'. I think perhaps it should be more legitimate to just publish I have collected data X and made it accessible to others, and justifying why it's better than other data out there.

    Even more radically, if you do substantial analysis on a data set generated by someone else perhaps you should be forced to give them some sort of publication credit, rather than just a cite. If you co-authored with someone on a 'I collect the data, you do the analysis' basis its not outrageous that you'd get a credit. It's strange that if you functionally the same the same division of labour doesn't produce anything for you if they download it from a repository.

    "Everyone is left with "trust me" which is not science at all. Policy and science both suffer."

    That's true. I've got to say the worst I've seen is in MBAish disciplines. You get papers saying I've got data from company Y, which I can't identify, and which which I can't share with anyone because of confidentiality. I get the impression that some communities – like bioinformatics, astronomy, climate science, and economic history – are very good though. I suppose it's because due to the nature of the field there's only only a limited number of main datasets which everyone has to study, or that data's collected sytematically by large agencies whose job it is just to generate data, rather than individual research groups.

  10. I participated in a AAAS panel with Stan a few years back about the lack of replicability of epidemiology studies. I ran a survey of journal editors, and the editors of epidemiology journals were significantly worse about data sharing policies than those in other scientific fields (medicine, pharmacology/toxicology, general science were the categories iirc).

    A few years back the Journal of the Royal Society Series B, which requires data sharing, published an article claiming that women who eat more cereal in the first trimester of pregnancy are more likely to have boys. It took a threat from the journal editors to get the authors to share the data with a friend of mine…who showed that the findings were a result of not correcting for multiple testing.

    I agree wtih Alex that there should be a good way to credit whomever collected the data; either a "data from" line amongs the authors (e.g. Title, by First, A, and Last, M, with data from Source, I, and Helper, J.), or the option of including 1 or 2 co-authors from among the data collectors (i.e. they see the paper about to be submitted, and have the option of including their names or not).

Comments are closed.