An actual quote from a paper published in a medical journal: “The data, analytic methods, and study materials will not be made available to other researchers for purposes of reproducing the results or replicating the procedure.”

Someone writes:

So the NYT yesterday has a story about this study I am directed to it and am immediately concerned about all the things that make this study somewhat dubious. Forking paths in the definition of the independent variable, sample selection in who wore the accelerometers, ignorance of the undoubtedly huge importance of interactions in the controls, etc, etc. blah blah blah. But I am astonished at the bald statement at the start of the study: “The data, analytic methods, and study materials will not be made available to other researchers for purposes of reproducing the results or replicating the procedure.” Why shouldn’t everyone, including the NYT, stop reading right there? How does a journal accept the article? The dataset itself is public and they didn’t create it! They’re just saying Fuck You.

I was, like, Really? So I followed the link. And, indeed, here it is:

The Journal of the American Heart Association published this? And the New York Times promoted it?

As a heart patient myself, I’m annoyed. I’d give it a subliminal frowny face, but I don’t want to go affecting your views on immigration.

P.S. My correspondent adds:

By the way, I started Who is Rich? this week and it’s great.


P.P.S. The above all happened half a year ago. Today my post appeared, and then I received a note from Joseph Hilgard saying informing me that this statement, “The data, analytic methods, and study materials will not be made available to other researchers for purposes of reproducing the results or replicating the procedure,” apparently is a technical requirement of TOP when the data are already publicly available—not a defiant statement from the authors. Hilgard also informed me that TOP is “Transparency and Openness Promotion guidelines. Journal-level standards for how firm a journal wants to be about requesting data sharing, code, etc.”

I remain baffled as to why, if the data are already publicly available, you couldn’t just say, “The data are already publicly available,” and also why you have to say that the analytic methods and study materials will not be made available. I can believe that this is a requirement of the journal. Various organizations have various screwy requirements, there are millions of forms to be filled out and hoops to be jumped through, etc. And the end result—no details on how the data were processed, no code, etc.—that’s not good in any case. It should be easy to have reproducible research when the data are public.

It’s amazing how fast standards have changed. Back when we published our Red State Blue State book, ten years ago, we didn’t even think of posting all our data and code. Partly because it was such a mess, with five coauthors doing different things at different times, but also because this was not something that was usually done. I felt we were ahead of the game by including careful descriptions of our methods in the notes section at the end of the book. But there’s a big gap between my written descriptions and all the details of what we did. When it comes to scientific communication, things have changed for the better.

Let’s just hope that the Center for Open Science and the Journal of the American Heart Association can fix this particular bug, which seems at least in this case to have encouraged researchers to not make their methods and study materials available.

55 thoughts on “An actual quote from a paper published in a medical journal: “The data, analytic methods, and study materials will not be made available to other researchers for purposes of reproducing the results or replicating the procedure.”

  1. Meh, at least they’re being honest. Most researchers check the box on the publishing consent form to say “Sure, we understand that we have to share our data with anyone who asks”, and wait to say “Fuck you” until anyone actually does ask. Of course, since they’re terribly polite — unlike those shameless little bullies and methodological terrorists — they don’t actually say that. They just consistently fail to answer the request, or they come up with some lame excuse that typically makes the requestor wish that they *had* failed to answer.

    A halfway step is to say “Data are available upon reasonable request”. Reasonableness is, of course, to be determined by the authors themselves.

    • Sorry, that is your particular agency. I am a Fed. We make petabytes of data easily and publicly available, see for instance:

      https://upwell.pfeg.noaa.gov/erddap

      This is the joint effort of some 20-30 disparate offices making a distributed data system look like one centralized repository

      And don’t be “fooled” by the web pages. The web pages are nice for certain types of users, but everything in it, and I mean everything, is a web service that can be easily programmed in almost any language or script, and can return subsets of the data in a variety of formats. Search, subsetting, graphics, etc etc are all web services – everything is defined by a URL, and anything that can send a URL and receive a file will work with it. All the web pages do is provide an interface to generate the URL. And for economic data there is FRED

      One of the problems is every one bashes the Feds for when they can’t get data (see the comments here), but few praise when it is made available, and I mean not just in blogs like this but to the people who make decisions where it counts. We almost got closed 5 years ago.

      • I love NOAA, especially as a sailor. Weather forecasts, all the charts of all the waterways in the US, in scanned and electronic vector formats. Thank you guys for what you do. My impression is that there are agencies who from the beginning were all about collecting and disseminating information: NOAA, Census, NBER, whatever. Those agencies seem to understand dissemination of information and they work hard to make it happen.

        Any agency whose primary function isn’t disseminating information, like HHS, EPA, FDA, IRS and DOD or whatever. They are usually impossible to pry information out of.

        • Daniel:

          Regarding the EPA, there’s this story from 30 years ago. But I agree with Roy that we shouldn’t automatically blame the Feds; they’re just a convenient target in part because of their openness. Lots of private researchers don’t share their data but we don’t even think about it because we wouldn’t even expect them too. (And, yes, government agencies have a special duty to share because we’ve already paid from our taxes; but lots of private research is being paid for in some indirect way from taxes or subsidies too.)

        • @Daniel:

          If DoD is difficult to get information out of, can one really complain?

          Is transparency being a good thing always axiomatic?

        • I’ll also note that using the FOIA to get the csv of a publicly available pdf is not really what it was intended for (Daniel’s example), so getting push back there is not horrible either. People have jobs to do.

        • Not true. FOIA states that the government must make an effort to provide any materials asked for provided they are not within a certain restricted set of information related to things like national security or containing personal information (like medical records of a service member for example). Since the info was publicly available from a DOD website in the form of a scanned PDF it couldn’t qualify under any kind of national security restriction etc. And by the way the Navy has *specific people* whose job it is to respond to FOIA requests so that *is* the job they are supposed to be doing.

        • Not to mention that the data has already been compiled in computer form and is sitting on someone’s desktop computer already, and has been handed out to other “buddy” researchers in the past, and the govt is allowed to charge a reasonable fee for the time taken to put it together, which I was willing to pay.

          The point is, because you have to sue to federal govt to get them to pay attention, and doing so is expensive, FOIA only benefits deep pocketed people like the Washington Post.

          I’ll also note that there are SCATHING remarks in case law from judges about how blatantly the govt violates the FOIA law, so it’s not just me complaining, federal judges think it’s atrocious.

        • Maybe I don’t understand your situation correctly; was it that there was a pdf of summary statistics and you wanted the raw data? My understanding from what you wrote above is that there was raw data on a pdf and you just wanted a csv file instead. If it’s the second, just use a pdf scraper. That’ll take less than an hour of your time. Going through all the hoops to find the owner of the raw data, get it reviewed for export, etc., will take at least 10 hours total work time, paid for by the taxpayer. You may be paying a fee, but trust me it does not cover the costs. Not a good use of government funds.

        • Damn right you can complain. The law *requires* them to respond, and within a short time window. There are specific people hired solely for the purpose of responding to FOIA requests, and the failure to respond in this case specifically prevented me from doing research that would have benefited Navy personnel but would have stepped on the academic toes of certain Navy researchers who make a living entirely off producing and analyzing this dataset. The information HAS been given out to their buddies in the past for example.

      • I have tried on multiple occasions to get some datasets from NOAA to no avail. (The first time I got a response of ‘the data owner is on a boat, ask in a couple months’, and nothing else since then, at which point I gave up).

        • Hi Jake:

          Please describe exactly what data you tried to get, and to whom did you make the request. Or if you want, email Andrew and he can give you my email address and correspond with me privately. There is an executive order for open data access as well as a NOAA procedural directive – the data owner in all cases is the government, as it should be. So instead of giving up, let’s try contacting the head of that lab,
          I can guess which part of NOAA this was about, but cases like this are a good way to force the issue, and some of us are willing to do so.

        • Hi Roy,

          I’m not comfortable calling people out by name in a public forum. I’ll just say re: your guess, it was under the “O” in “NOAA”, and not climate-related.
          I’ve got a shim email at : [email protected] and can give you Division-level details.

  2. AHA journals actually now require authors to make a disclosure along those lines:

    “Authors must, at the beginning of the Methods section, indicate if they will or will not make their data, analytic methods, and study materials available to other researchers.”
    https://www.ahajournals.org/TOP-guidelines/ExampleDisclosures

    As Nick said, at least this saves potential requesters from wasting their time pursuing data that will never arrive.

    Also, I suppose another benefit is that it makes the lack of transparency a potential target for reviewers. It’s a little strange, though, that the burden is then basically put on reviewers to decide whether lack of transparency is a significant concern for a particular study. What do you consider? The reputation of the authors? How unexpected the results are? Whether similar studies have had replication problems in the past?

  3. Well, this is simply not science. Science is not a Revealed Truth: it is evidence-based and this evidence can (and must) be discussed, debated, amended… A result which rests on nothing is not a result. It is a proofless statement. What is stated without proof can be refuted without proof.

  4. I’m glad you included the Who is Rich? link in the PS. I remembered you had recommended a book in a way that made me really want to read it, but I couldn’t remember what the book was or figure out how to search for the post.

    • That sounds like it just covers the data, right?

      The wording is horrendous if the intent is meant to be ‘we can’t give out this data ourselves, here’s the source’, but that’s not the author’s fault.

      But why can’t they share analytic methods?

      • “But why can’t they share analytic methods?”

        I think this might be (due to) another possible “quirk”, and/or sub-optimal wording, of the TOP-guidelines.

        1) I can imagine that the 1st issue concerning the statement about the availability of the data could stem from the following TOP-guidelines wording (page 4, https://osf.io/ud578/):

        “authors must, in acknowledgements or in the first footnote, indicate if they will or will not make their data, analytic methods, and study materials available to other researchers”.

        I can understand that the words “make their data (…) available (…)” could result in problems when taken literally in the case of already available data (i.c. technically, the authors will not make the data available because they already are).

        Side note: a few sentences before the possibly unclear sentence, things are described differently:

        “For level 1, the published article states whether or not the data, code, materials are available, and if available, how to access”.

        Perhaps if the journal and/or authors would have read, and focused on, that sentence things would be different.

        2) As for the 2nd issue: on page 3/4 of the TOP-guidelines the following is stated:

        “Transparency guidelines for data, analytic methods, and research materials are conceptually distinct. They are presented together as the process principles are similar for each”

        Perhaps it could be that journals in general, or this journal in particular, simply decided to lump all these 3 things together in a single statement. How i interpret the sentence above from the TOP-guidelines is that they can be depicted separately for things like data, analytic methods, and materials. I just skimmed through the guidelines and pre-regitration is yet another (separate from these 3) thing the authors could have made clear, if i am not mistaken.

        Altogether it seems to me that the TOP-guidelines could be re-written much more clearly. In the present version, 1) different words/sentences for the same things, and 2) sub-optimal wording and explanation across sections of the guidelines might easily result in the types of issues mentioned in this blog post and comments.

        • Anon:

          I remain confused. If the rule is that “authors must, in acknowledgements or in the first footnote, indicate if they will or will not make their data, analytic methods, and study materials available to other researchers,” then why can’t the authors, in acknowledgements or in the first footnote, write something like this: “Data are available from organization XX at the url YY. We have posted our analytic methods and study materials at url ZZ.”

          It could just be that the authors checked a box and then the journal put in the boilerplate language. I’ve filled out tons of forms, and I usually feel that I have better things to do than read all the fine print. I check boxes all the time; who knows what I’m signing off on.

        • “I remain confused. If the rule is that “authors must, in acknowledgements or in the first footnote, indicate if they will or will not make their data, analytic methods, and study materials available to other researchers,” then why can’t the authors, in acknowledgements or in the first footnote, write something like this: “Data are available from organization XX at the url YY. We have posted our analytic methods and study materials at url ZZ.””

          To me, that would only be common sense (but this is academia!), and i think would even be in line with the (goal of) the TOP-guidelines. I think this could possibly have to do with the exact wording:

          I reason that perhaps in this case where (if i understood things correctly) the data are already publicly available/open, the authors of the paper in question can’t (technically) “(…) make their data (…) available to other researchers” (because the data already are/someone else already made them available).

          Note that this only possibly holds for a statement about the data, as i reason even if the possible scenario i described is what happened in this case, they still could have written different statements concerning the code (and possibly the materials and/or other things like pre-registration). This is because (if i understood things correctly) these are all separate things that could be stated independently of each other according to the TOP-guidelines, and things like code/materials/pre-registration will (unlike the data in this case) not have already been made available by others.

          Like i stated above, i think this case makes clear that sections of the TOP-guidelines could possibly be re-written much more clearly. The “Center for Open Science” gets millions to “improve” (or further totally “screw up”, depending on who you ask) science, so i guess they could pay someone to have a closer look, really think about things, and possibly re-write some stuff…

    • “How did Lindsay (the editor of the journal) let that go by?”

      Perhaps he was too busy counting exactly how many papers with pre-registrations were submitted to his journal. That would also be in line with him simply stopping a discussion and not answering simple questions about practices at his journal here:

      http://statmodeling.stat.columbia.edu/2018/04/15/fixing-reproducibility-crisis-openness-increasing-sample-size-preregistration-not-enuf/#comment-712159

      Side note 1:

      Can’t wait to find out what Lindsay means when he stated that “standards for what is accepted as ‘preregistration’ will gradually increase”. Come to think of it, this might also directly relate to the TOP-guidelines, where (if i understood things correctly) the ultimate level 3 involves some sort of “verification” by the journal (or even a different party?).

      Perhaps at the ultimate level 3 of the TOP-guidelines they will not even make things like the pre-registration information available to the reader anymore, because they will have been “verified” by some party! It wouldn’t surprise me at all after the recent “Registered Reports” debacle (e.g. see https://osf.io/preprints/bitss/fzpcy/)

      Side note 2:

      When reading some more about the levels in the TOP-guidelines, similar possible problematic issues concerning using different words for the same things, and other writing that could be perceived as being unclear caught my eye.

      For instance, page 8 of the guidelines (https://osf.io/ud578/) mentions “certification of the preregistration” which then links to the “open practices badges”. Are these things the same? If yes, why use different terms which can only result in confusion? Regardless of that, if i understood things correctly, journals can hand out badges without “verifying” them (https://osf.io/tvyxz/wiki/2.%20Awarding%20Badges/). If this is correct, how can the badges be used for both “verified” and “unverified” practices?

      Also, in one part/source of the guidelines (https://osf.io/4kdbm/) i can read that level 3 involves verification by a 3rd party, but in the text on page 8 and 9 concerning “pre-registration” i can read “verification” belongs to level 2.

      Also, why isn’t the format of level 1 (disclosure), level 2 (requirement), and level 3 (verification) used with pre-registration? Surely a level 1 disclosure statement concerning pre-registration is as useful as one pertaining to, for instance, data…

      I will repeat my earlier statement in this comment here as well that i think the TOP-guidelines could be re-written much more clearly, and better. Assuming the “Center for Open Science” still receives millions to “improve” (or further “mess up” depending on who you ask) science, i think they should be able to hire someone to really read things thoroughly, and really think about matters some more, and possibly re-write the guidelines…

  5. Apparently, (a) a data disclosure statement is required by the Journal; (b) most papers in the same issue (and I assume in other issues) use the same form of the statement, verbatim. The road to hell is paved with good intentions.

  6. It looks like JAHA is now requiring all authors to include a data sharing statement in the methods section. A bunch of other papers in the same issue have similar statements, though it looks like most other studies that decline to provide data, etc, give a reason. For this study, it makes no sense — the data are from NHANES and thus publicly available — what legitimate excuse could there be for not sharing the statistical code?

  7. Saw some discussion of this on Twitter. It appears this is a case of the data being publicly available but on the precondition that anyone who downloads it doesn’t share it, similar to the ANES. Of course, when I’ve been in this situation, I’ve just shared the R files that will reproduce the analyses as long as you have the raw data files that you can get directly from the data collector.

  8. In 9 out of the 12 categories, the subjects on average engaged in 57 minutes or more of moderate to vigorous physical activity per day. What were they even measuring?

  9. I am now very confused. I read the TOP guidelines and they have several levels for journals to adopt – the lowest level just requires that the authors state whether or not they will provide the data and methods. Hilgard seems to be suggesting that TOP requires the statement in the paper in question – that it will not be shared. It does not seem to me that TOP requires that at all – it only requires that the authors state whether or not data will be made available. Then, Jacob says that using ANES data has a precondition that the data not be shared – but, as he says, this does not preclude releasing the code used to obtain and process the publicly available data from ANES. So, if I am reading all this correctly, the facts are
    1. This particular journal has adopted a low level of open data – namely that the authors only need say whether or not they will provide the data, code, etc.
    2. The authors are using ANES data, the use of which requires that they not re-release the data.
    3. Nothing in the use of ANES data precludes releasing the details of the “methods” used.

    So, why shouldn’t we tarnish the reputations of both the authors and the journal? It seems that both are guilty of poor – and unacceptable in my view – research practices. The only thing I would absolve them of is their statement that the actual raw data will not be provided, since the ANES data requires that. Even then, wouldn’t they want to say the reason why they won’t/can’t provide the data.

    Did I understand the facts correctly?

    • Thanks Dale,
      I just had a look at the TOP Guidelines and could not see any thing like the statement that Andrew quotes.

      Did I understand the facts correctly?
      Unless we have both missed something in reading the Guidelines I think you understood the facts correctly.

    • To be clear, their data are not the ANES, it’s just my understanding that the access protocol is similar. I know I’ve seen the survey data they are using on ICPSR, but I’ve not ever looked closely at whether there are other ways to get it (data archived on ICPSR is generally only available to member institutions).

    • I’m sorry but I do not see what you are saying at all. When I go to that site, it says that you either make data and analytics available, and if not, make a statement about what, if anything, will be available. This is not the same thing as saying that the AHA demands the statement that is the subject of Andrew’s post. So, the AHA is not mandating that the authors make nothing available, and if the data will not be available, it only says you must make a statement. In other words, the requirement is that you make it clear what you are willing to do and what you will not be willing to do. I don’t see this removes any of the responsibility from either the authors or the journal for what Andrew has cited. Similar, to Hilgard’s statement above, I am puzzled by what seems like a defense of their position on the basis that it is somehow out of their control.

  10. –snip–

    I remain baffled as to why, if the data are already publicly available, you couldn’t just say, “The data are already publicly available,”

    Aren’t NHANES public use data? In which case, it seems to me that their statement implies an assumption that readers know that NHANES data are publically available.

    Unfortunate that this post has been used over at WUWT to reinforce rightwing hyperbole about academic conspiracies.
    –snip–

    • Sorry, that was supposed to read like this (using “-snip-” to offset a quotation).

      -snip-

      “I remain baffled as to why, if the data are already publicly available, you couldn’t just say, “The data are already publicly available,””

      -snip-

      Aren’t NHANES public use data? In which case, it seems to me that their statement implies an assumption that readers know that NHANES data are publically available.

      Unfortunate that this post has been used over at WUWT to reinforce rightwing hyperbole about academic conspiracies.

      • > Unfortunate that this post has been used over at WUWT to reinforce rightwing hyperbole about academic conspiracies.

        Link?

        Not sure about the context, but a refusal to share data and code has been a recurring theme in the debates over climate change . . . which sure seems directly relevant to this post.

        • https://wattsupwiththat.com/2018/10/19/nytimes-promotes-an-eff-you-level-of-irreproducible-science/

          An example:

          -snip-
          ….we’re in the new Ford/Kavanaugh era
          -snip-

          As I mentioned below, this is one of a series of issues that are recurring themes across a number of (IMO) proxy battles for a large culture war. Climate change is one of those proxy battles – where people often leverage (and demagogue) the issue of data availability to serve larger political agendas (for example, I have often read complaints about “skeptics” failing to make their data available for analysis).

          IMO, it’s kind of tricky because data availability is a legitimate issue, and has some legitimate implications/connections to political issues…but I think there is a problem in that there comes a point of diminishing and then ultimate negative returns – where the legitimate questions become buried beneath tribal, identity-protective cognition. As an example, I think that there are some legitimate concerns from those who have some reluctance to make data available (even if I think that overall, more people making more data available is an clearly positive trend) – concerns which cannot be effectively addressed when the issue gets polarized tribally.

        • Perhaps my special role here is expert on analogies!

          > to put it on a par with attempted rape

          You misunderstand the comparison. Folks at places like WUWT are very upset about the data sharing (or not sharing) practices of the scientific community involved with climate research, especially folks like Michael Mann and Phil Jones. (Side note: Andrew would, I think, 100% agree with them on this.) This is why they found your post to be of interest. They especially hate the argument that we — non climate scientists — should accept the judgments of the climate scientists without having the option of closely examining their data/code. “Take nothing on faith” is a version of their battle cry.

          So, they are not comparing a refusal to share data with Kavanaugh’s (alleged!) teenage pawing. They are comparing the demand that people believe Ford (regardless of the (understandable!) lack of corroborating evidence) with the demand that people believe climate scientists, regardless of their refusal to share their data/code.

          Hope this is helpful!

        • Folks at places like WUWT are very upset about the data sharing (or not sharing) practices of the scientific community involved with climate research,

          And yet, they aren’t “very upset” about the data sharing practices (or more accurately, the lack thereof) in the climate research of “skeptics.”

          Interesting, isn’t it?

      • Joshua:

        Sure, there’s no need to say anything at all. Most papers that I’ve seen don’t have data available and don’t say anything about it. Other papers use publicly available data but don’t give details of exactly how they analyzed the data—there are typically lots of choices, even with public data, of what measurements to use, how to deal with missing data, etc. It would be difficult to reproduce most of my own published research just based on the descriptions in our papers. As I wrote above, expectations are changing, and I’ll try to do better in the future. Making analysis replicable takes some work on the part of the researcher. However, if done right it should reduce the effort of people who are trying to understand what’s been done.

        Anyway, if the data, code, etc., of the article discussed above are all publicly available, that’s great. In that case, the following statement in the journal is just misleading: “The data, analytic methods, and study materials will not be made available to other researchers for purposes of reproducing the results or replicating the procedure.” The whole thing sounds more like a mess than a conspiracy, and it doesn’t sound particularly leftwing or rightwing.

        • Andrew –

          I agree that the statement is kind of odd. But just from reading it, I had the impression that they were assuming the reader knew that NHANES is public use.

          -snip-
          As I wrote above, expectations are changing,…
          -snip-

          Just a few days ago, a researcher using NHIS survey data told me that the American Journal of Preventative Medicine recently required her to upload data, even though it is publicly available. It seems that this was not a requirement relatively recently.

          -snip-
          The whole thing sounds more like a mess than a conspiracy, and it doesn’t sound particularly leftwing or rightwing.
          -snip-

          In general, I think that the trend is in a positive direction, although I am also concerned about the demagoging of the issue for political expediency. In and of itself, in isolation, there is nothing particularly right- or left-wing about this particular issue. However, in a larger context I think that it certainly is another in a long line of ideological battles where, it seems to me, quite a few people are interested in holding the issue of data availability hostage for service in a larger culture war.

Leave a Reply to Liorel Cancel reply

Your email address will not be published. Required fields are marked *