“It’s not reproducible if it only runs on your laptop”: Jon Zelner’s tips for a reproducible workflow in R and Stan

Jon Zelner writes:

Reproducibility is becoming more and more a part of the conversation when it comes to public health and social science research. . . .

But comparatively little has been said about another dimension of the reproducibility crisis, which is the difficulty of re-generating already-complete analyses using the exact same input data. But as far as I can tell, the ability to do this is a necessary precondition to the new-data replication. . . .

But I think the ease with which we can re-generate complete analyses, on our own computers and those of others, plays directly into the bigger questions of openness and integrity that underlie some of the challenges to reproducibility.

Zelner continues:

Many of you will have experienced the shiver of fear that comes from reviewer comments suggesting that a group of cases should or should not have been dropped, or that a variable should have been coded in a different way.

My first reaction in such situations has historically involved a slightly queasy feeling as I imagine laboriously stepping through each of the downstream things that has to happen (re-run models, re-generate figures, re-construct tables!) all as a result of a small modification of the way input data were cleaned or transformed.

The friction involved in making these changes increases the incentive to cut corners and to not take potentially useful feedback seriously. It also makes it difficult to do incorporate new data as it becomes available, perform sensitivity analysis by re-running the analysis on perturbed datasets, etc. . . . we end up with finished papers backed by a morass of spaghetti code that we hope to never to have to run again.

That is soooo true.

Zelner then gets into details:

So, all of the elements of reproducibility I will discuss over the next series of posts are collected here, which is a git repository demonstrating a toy example of a project using R and Stan that can be fully replicated.

Specifically, it’s focused on fitting a Gaussian finite mixture model to simulated data. Here’s what the output should look like.

I picked R and Stan for this because they are what I live and breathe in my day-to-day research. . . .

Yesssss!

And this is what Zelner has so far:

Part 1: Your R script is a program!

Part 2: Makefiles for fun and profit

Part 3: Knotes on Knitr

Part 4: It’s not reproducible if it only runs on your laptop.

Part 5: Bringing it all together with Gitlab CI

Enjoy.

P.S. As indicated in the title above, I like the term “workflow” for this sort of thing.

18 thoughts on ““It’s not reproducible if it only runs on your laptop”: Jon Zelner’s tips for a reproducible workflow in R and Stan

  1. My bible during my masters program was J Scott Long’s “The Workflow of Data Analysis Using Stata” (2009). It really was an under-appreciated book, as it promoted reproducibility principles a few years ahead of the main movement.

  2. Just a minute or two ago I’ve read a paper about good practices in scientific computing. Here’s the link: https://arxiv.org/pdf/1609.00037v2.pdf (19 pages)

    They outline some relatively easy practices to start with; about data management, software practices, collaboration, project organization, tracking changes and manuscripts.

    Hopefully at some point these things will be covered in classes in undergraduate level. These are all things of habit and are not difficult in and of themselves. All that’s required to start with them is a nudge of your instructor.

  3. My data analysis practices changed quite a bit after doing Hadley Wickham’s course, which you can see Part 1 of here. I now write a package every time I do an analysis, and the push it up to github once the paper is published. I also try to go back and generalize the functions so that I can write high level commands for producing paper-ready plots. I still don’t do the kind of rigorous testing Hadley advocates though.

  4. Does anyone know of a real-world example of a published paper following this strategy? I think that would serve as a great educational example. Sure small examples like these are very helpful, but when I try this approach with a real world project, I always end up getting lost in an ocean of complexity, and cut corners (as in hardcoding variables, manually fixing stuff underways, etc etc)

    • Eric.

      This is an example that followed that strategy (and actually tested it) – Canada can compete! 1985. https://books.google.ca/books?id=X8N3FCjMM1kC&pg=PA166&lpg=PA166&dq=%22Canada+can+compete%22+Fleck&source=bl&ots=ku4dFASv07&sig=L0dxs9VVoMq69ryBpV00yYPWx6A&hl=en&sa=X&ved=0ahUKEwjhz_n614XQAhXL6CYKHXAoCzoQ6AEIJzAB#v=onepage&q=%22Canada%20can%20compete%22%20Fleck&f=false

      It worked for a simple reason, Kevin Chang and I were told it was our first priority by Jim Fleck (who had adequate resources that allowed us to do it). It was done in Lotus 123 with fully documented macros.

      It was tested a year after Kevin and I left, when a new research assistant was hired and all were instructed not to talk with then until they had regenerated all the results we had obtained. Then, after they had been successful, I met with them for a debriefing to provide further context.

      I believe you will find success stories way back in the 1800,s or earlier – it not rocket science its good management of scientific work – period. (Yes better technology makes it easier, less expensive and more likely to succeed.)

      • Keith,

        Your last sentence got me wondering about whether Darwin’s field notebooks would be an example. I found the following that suggests that at least Darwin realized the problem:

        “Darwin wrote later in Journal of researches about making notes in the field: ‘Let the collector’s motto be, “Trust nothing to the memory;” for the memory becomes a fickle guardian when one interesting object is succeeded by another still more interesting.’ (p. 598.) And in 1849 he wrote in his contribution to the Admiralty Manual:

        [A naturalist] ought to acquire the habit of writing very copious notes, not all for publication, but as a guide for himself. He ought to remember Bacon’s aphorism, that Reading maketh a full man, conference a ready man, and writing an exact man; and no follower of science has greater need of taking precautions to attain accuracy; for the imagination is apt to run riot when dealing with masses of vast dimensions and with time during almost infinity. (p.163)”

        http://darwin-online.org.uk/EditorialIntroductions/Chancellor_fieldNotebooks.html

        • Martha–
          Joseph Grinnell is perhaps a better example from biology. https://en.wikipedia.org/wiki/Joseph_Grinnell#Grinnell_Method_of_note_taking is the standard method that several generations of field biologists have been trained in.

          The rigorous field notes let any other scientist repeat the field observations at the same sites, or repeat the same method at different sites. One can quibble whether field natural history observations are “reproducible”, or whether the goal is more to record all of the additional information that might help explain (as in “find patterns to”) different results the next time or place. The second key is that Grinnell knew his data would grow in value over time, hence he employed the best archival technology of his time: archival high-quality paper & ink and well-prepared specimens, in addition to the comprehensive field notes.

          The proof of the value are the “Grinnell Transects” across the Sierra Nevada in Yosemite, Lassen, Sequoia, plus deserts and the North Coast in California early last century. The Grinnell Resurvey Project http://mvz.berkeley.edu/Grinnell/ replicated the work ~100 years later. Almost all specific locations along those transects were successfully relocated from the field notes. The flashy science is how many species ranges have shifted up in elevation.

          I work for the National Park Service Inventory & Monitoring Program. The Grinnell Resurvey is how I explain to colleagues the importance of discoverable, archived data with that level of reproducibility/usage documentation as the “way-back machine” for future park managers & scientists. [Current technology is databases and metadata and reproducible workflows in archival hardware & formats, and images & DNA samples rather than stuffed birds.] I came across this post because I’m currently evangelizing reproducible workflows on the back end to field scientists who were trained in the Grinnell field notebook system on the front end of their research. This pitch for end to end reproducible science is less than successful. Pitching reproducible workflows as automating repeated reporting gets a much better reception.

        • Thanks. Interesting.

          But it also reminds me of a story a botanist told me: about a student who methodically made observations, starting at the bottom of the mountain and observing all the plants (of the type of interest) at that level, then proceeding up the mountain, starting each day where she left off the previous day. So in the end, altitude and time were confounded. (But at least the methodical recording made that clear.)

  5. It’s a great workflow, but may not be for everyone. I’ve run into trouble with the standard software workflow where a cluster is required, where datasets are very large or confidential.

    While I would like to do the whole docker/CI thing, I just don’t see it working for me. I’ve to run lots of analyses on a cluster. Currently I document my job scripts and the cluster environment, but I don’t see a way of dockerifying everything that follows from running these analyses, especially since the stanfit objects are often >1GB, so I don’t want to commit them to git.

    Does anyone know of a similar guide to the “Rstudio” approach to a reproducible workflow?
    To me currently, it’s not that important to have dependency management (that’s fairly linear/doesn’t seem worth the significant extra effort), but it’s important to me that the whole of the process is actually well-documented, because I think it’s much more likely that someone reads my documentation/study report than that someone tries to actually reproduce it.
    So, I heavily rely on rmarkdown to turn all my scripts (.R and .Rmd) into a website, documenting the whole process on different pages. I might try to write a guide about this workflow as well at some point.

    • Ruben,

      Have a look at our presentation on ownR – a product suite focusing on enterprise R deployment and management including integration with continuous integration tools, package sharing privately (not via GitHub), etc. In our internal setup we use Jenkins to automate not only deployment but also roxygen based documentation, running unit tests, etc and reproducibility is solved by making previous versions available transparently (if the DESCRIPTION file specifies a version, our open source container management tool roveR installs the right versions to the project container).

      The presentation itself is available here: http://www.slideshare.net/DavidKunFF/ownr-presentation-erum-2016. We have presented it at eRum a while ago, hereby the video shot at the conference: https://www.youtube.com/watch?v=p84Up9EWLG8

      It does not solve the documentation of the whole workflow, only the workflow of creating an R package, making it available to other R users via install.packages and making the chosen functions available via a REST API for non R users. If your workflow includes multiple steps outside R, you either just hand over the functionality to the next guy with documentation of the R part or you can use some Business Process Management tool that is capable of calls via REST API.

      If you need any further info, please drop a mail to the e-mail address at the end of the presentation.

      Best,
      David

    • Ruben,

      Presumably you would not commit the stanfit objects to git but rather generate them from data and scripts.

      Nevertheless, there are still logistical issues with managing large data sets and the large stanfits they generate. I followed Zelner’s tips on my last two projects, but I ran into resource issues when I got to the docker step. My server does not have enough resources to simultaneously allow me to work and have docker run my analysis. Soon I will make time to set up a separate server to run docker, and then I will work on CI.

      Even without docker/CI, I have found the workflow very useful. At first, forcing everything to be reproducible feels onerous, but eventually I felt liberated by being able to reproduce the whole analysis non-interactively. Making the process reproducible is its own kind of documentation.

      • I hadn’t condensed my point very well:
        I think there is use for a less complete software approach to reproducibility in science, which separates the data from the pipeline, but makes sure that the pipeline code is stored _with results_ (knit documents).

        The reasons why this would be necessary:
        – data is too big/too confidential to upload anywhere outside trusted environment (no Github, Travis etc).
        – access to clusters is needed, but obviously cannot give access to Github, Travis etc. + reproducibility on the cluster is its own problem

        The reasons why it could be better:
        – I think it’s far more likely that people read my analysis reports than that they reproduce it all
        – they can glean details about the process that weren’t clear in the text
        – they can glean small results (such as results of model checking or extensive descriptives) that don’t make it in into the average journal article.
        – most readers (in my field anyway) won’t know how to reproduce something using Docker (not even where to start), so you’d limit yourself to the most technically minded

        so @Eric I can store neither my data nor the fits in the repo. The data is confidential (and the fits contain data) + the models cannot be fit without a cluster.

        @David: Do you solve the problem that R doesn’t have proper dependency management? This is the website? Can’t see any details… http://ownr.io/ Anyway, it’s not free, right?

        That said, I don’t even know how to manage my Rmd files in an R package. Is that possible ?

        • Hi Ruben,

          Yes, we solve the dependency problem – again, see lair.ownr.io. Indeed it is not free, it’s a licensed product.

          As for the management of Rmd files, I can only recommend using in-line documentation roxygen style and using roxygen2 to actually generate the documentation. This way we don’t even check the Rmd files in to Git – instead Jenkins (our CI tool) is used to generate the documentation during deployment.

          Cheers,
          David

        • You do have to checkin vignettes since they aren’t generated. Also templates if you have them. A good place is in inst/rmarkdown/templates for templates and inst/doc for vignettes.

        • Ruben,

          Re dependency management, I think I have a better guess now on what you meant. I have just been helping someone on LinkedIn (https://www.linkedin.com/groups/77616/77616-6200716635842580484?trk=hp-feed-group-discussion) who had trouble with installing a package. I looked into it to find that one of the dependencies cannot be installed. A closer look revealed that it depended on dplyr but dplyr got updated and this package is not (yet) updated and some functionality it depends on is gone from dplyr now. This obviously breaks the install since the guy was imprecise with specifying the version of dplyr they depend on (said >= 0.4).

          For this we do have a solution actually but it’s two tools: one is open source and free of charge: roveR. It creates a container for each project separately and it changes your R profile to install all packages into this folder. This way you have a version of dplyr and the dependent package that actually works and you can test if everything still works after an update of any or all of the packages. The command to install roveR is in the presentation I linked earlier.

          Together with laiR, the licensed pillar in our toolset it is possible for roveR to install specific (older) versions of dependent packages while CRAN doesn’t support that. So in this case if the guy had been more precise and stated that the dependency on dplyr is >= 0.4 AND <= 0.9 (the latter is just an example, though) then roveR would install the latest version of dplyr that still satisfies this dependency. This needs laiR or you need to do it by hand.

          Does this answer your question on dependency management?

        • I think in this situation you could do what is done in unit testing software that takes inputs, meaning that you create minimal data sets that reproduce the structure and smallest data needed to insure that your scripts run without error and produce the same results. Then you can store the confidential/large data elsewhere and provide access to it in a different way.

Leave a Reply to David Kun Cancel reply

Your email address will not be published. Required fields are marked *