Different challenges in replication in biomedical vs. social sciences

Summary

– In biological sciences, it might be reasonable to expect real effects to replicate, but carrying out the measurement required to study this replication is difficult for technical reasons.

– In social sciences, it might be straightforward to replicate the data collection, but effects of interest could vary so much by context that replication could be difficult.

This is all interesting because it has nothing to do with p-values, forking paths, or statistical analysis. It’s all about the difficulty of measurement and variation in underlying effects: two topics that are typically ignored entirely in statistics textbooks and courses.

Background

And here’s how this came up:

Kleber Neves, who works with the Brazilian Reproducibility Initiative, writes:

We briefly discussed how the technical expertise required to perform the experiments is an aspect that differentiates biomedical and social sciences.

On that note, I thought you might be interested in this paper I just stumbled upon by David Peterson. It is a compared ethnography of a molecular biology lab with a psych lab, whose main point is about the very distinction we discussed.

I replied that I wonder whether things will change now that biologists can demonstrate their procedures on Youtube. On the other hand, there must be some biologists who view their lab techniques as trade secrets and don’t want their work easily replicated…

And Neves responded:

Yes, there’s actually a journal for recorded experiments, which works as a mild incentive for sharing your trade secrets. I’d say Youtube is common when the technique is new and nobody is physically available with the expertise. That said, I think most of this motor learning still happens through an apprenticeship of sorts (someone more experienced goes to the bench with you to teach, closely following you until you’re good at it).

What I find curious is that it ends up adding a new forking path. It goes something like this: “well, this experiment did not give a significant result, but maybe you don’t have a good ‘hand’ for the technique, it’s probably why your controls were not as they should be. Do it again until you get it right”.

21 thoughts on “Different challenges in replication in biomedical vs. social sciences

  1. This is the most important, and yet the most neglected, subject in statistical training. It’s an awful long time since I worked in a University (44 years) but I doubt things have changed much. In my day the statistics courses given to biologists, sociologists, psychologists etc, were presented by the most junior member of the Statistics Department. The course given was the same as was given to the first year mathematical statisticians. But what was needed was a basic course on the practicalities of running experiments to ensure that a design worked out on paper could deliver reliable results in the field.

    The first two sentences in your summary don’t make much sense to me. Experimental design procedures can overcome all of the difficulties you could imagine.

  2. Somehow this reminded me of the Brian Kernighan quote about writing code –

    “Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?”

    Might this apply to both statistical analyses and experimental design?

    • I take the point of the quote to be, “if you put all your effort into writing your code before debugging it, you’ll have a hell of a time figuring out why it doesn’t work”.

      I do think there’s a parallel here to both experimental design and stats. The parallel with stats is obvious since stats is applying a set of computational procedures to a set of inputs (data), which has the same logical structure as functional programming. And just like debugging a program, you want to make absolutely sure each step of your analysis is doing what you think it’s doing before you try plugging them together.

      But I think there’s also a parallel with research design, in that there’s a school of thought not exclusive to any particular science that is basically, “well, this is a cool question so let’s just collect some data and figure out what it means afterwards.” (This is the same school of thought that inspired the Fischer quote about stats after the fact being able to tell you “what the experiment died of.”) I see this as the same kind of cavalier attitude behind poor debugging and poor analyses, that someone is so convinced that they will get it right the first time that they don’t bother to check their work until it’s too late.

      The problem is that, unlike in programming or computation, you can’t get immediate feedback on elements of research design in most cases. It costs money and time to collect data* and you typically can’t treat elements of an experimental design as functionally independent in the way you (hope you) can in programming and analysis. This is where I think simulation can play a crucial role—as you design your experiment, try to simulate expected results under different scenarios to make sure that your experimental design is capable of answering the questions you want (e.g., that there is sufficient power to detect effects/interactions of interest and/or to measure quantities with sufficient precision).

      * Computation is cheap these days, but used to cost a lot of time and money too (and still does for big problems). I was recently reading an article from 1972 that used a lot of simulation modeling and the author explicitly said, “to save money, we did not simulate conditions XXX from this experiment.”

      PS: There is a lot of work from people like Mark Pitt and Jay Myung on optimizing experimental designs “on the fly” that relies on getting instant feedback during an experiment, but so far these are limited to scenarios where the experiments are very simple and the models are reasonably tractable.

  3. Interesting points. I (relatively) recently worked on a five-year biomedical research project that involved wet lab scientists, bioinformaticians, and manufacturing and process development engineers.

    One of the striking things that came out of the study was the total disparity in operational tolerances and performance of identical lab equipment purchased and installed across multiple EU sites, even compared to the manufacturers’ published technical specifications. Moreover, the protocols followed by the wet lab scientists were deemed far too imprecise (compared to the SOPs of the manufacturing world) and the physical activities involved in experimental procedures too cryptic to ever be replicated in exactly the same manner by other groups.

    In light of this, it’s a wonder that biomedical studies replicate at all!

    • Basic science medical research is indeed problematic in the way you describe.

      Clinical research faces different issues. In typical clinical studies, the measurements themselves are well standardized, often things that are widely used in commercial laboratories that are subject to regulatory standards that require frequent recalibration of equipment, etc. And in large studies carried out a multiple clinical centers, most measurements would be carried out using a single reference laboratory to eliminate the remaining variance from lab effects. Measures that require substantial amounts of rater-skill or rater-judgment, or rely on equipment that cannot be reliably calibrated are seldom used in large clinical trials these days.

      The challenges in clinical research are different. There can be a great deal of person-level variation in response to interventions based on any number of behavioral or environmental factors that are impossible to control and frequently are difficult or impossible to even identify, let alone measure. Hence the need for very large sample sizes to enable randomization, hopefully, to distribute these evenly across study arms within the trial. But these very same factors may well differ when a trial is “replicated” in a different population or setting, and the randomizations, of course, cannot distribute these equally across different trials.

      • Yes, often the pressure to replicate within the same research group (let alone between other groups and across countries) results in the “keep replicating until you see approximately the same effect” approach.

        Still, give all the confounding factors that can inhibit reproducibility, perhaps our results are more robust than we might think?

  4. I worked on a hearing aid clinical trial when I began my grad program 25 years that was illuminating with regard to things that can screw up replication. The audiology clinic, replicating a trial done at another med school to finalize device approval, used audiology students to fit the hearing aids on patients instead of audiologists, ostensibly to save money plus give students training. The results, of course, didn’t replicate due to poor fitting or the aids but I do remember we could statistically show an ear effect meaning the left ear aid performed better than the right ear aid (or vice versa, it was 25 years ago). Turned out that the left ear fitting was done by a student getting ready to graduate and the right ear by a first year student.

    Hence the lesson for me that variation in the replication itself can be as important as other sources. It’s a lesson that social science rarely acknowledges.

      • Good thing that this study was also my first experience with “the file drawer effect.” All concerned parties involved thought it was best to bury something conducted by first year audiology and statistics grad students for everyone’s reputational sake.

        I probably violated some non-disclosure agreement just by mentioning it 25 years later.

        Wait a minute, aren’t you part of the methodological hit squad known as Statistical Murder, Inc.? Then it’s also a good thing my nom de plume was different back then…

  5. In life science research, blaming non-reproducibility on technical reasons is called “golden hand” excuse- “only my technician or post-doc could do this kind of experiment” excuse. There is no experiment which could not be repeated by general scientific community. For example, Shinya Yamanaka published an article describing how to induce adult cells into stem cells by introduction of four genes, a tricky technique, many were able to reproduce his findings within months. He was awarded Nobel prize. Another group in 2014 claimed to generate stem cells by another method (Stimulus-triggered acquisition of pluripotency), no one could reproduce their results. At first, they claimed that their method was too much demanding but later it was found to be a fraud. I don’t think claiming technical reasons for non-replicability is acceptable.

    • Well said. Every method of published research should be reproduceable. Researchers should be afraid to go to press with useless methods descriptions. It’s just not that hard or time consuming and as you show in the case you describe, allowing the community to confirm one’s results is an obvious plus for any researcher.

      contrast with:

      “It’s just wasteful for everyone to scrutinise methods and data (which requires substantial training and expertise). Most people are instead going to have to trust others who have done that at least to some extent. ”

      https://statmodeling.stat.columbia.edu/2020/02/24/vaping-statistics-controversy-update/#comment-1248471

    • Imagining Guy “I don’t think claiming technical reasons for non-replicability is acceptable.”

      I agree, it shouldn’t be acceptable, but the fact is that the golden-hand “excuse” is a real and persistent phenomenon, which should drive more careful reporting of methodology and standardisation of practices.

      I know someone who spent their whole PhD pursuing a method that their supervisor dismissed based on the (flawed) work of a previous post-doc. Fortunately, they were dogged enough to believe in their work, but the pressure to follow the path of least resistance must have been very great.

  6. In materials chemistry, an area that is neither biomedical science or social science, my group recently tried to quantify how often a newly synthesized material gets synthesized again by anyone. This gives at least some partial insight into how often replicate experiments could be reported in the literature, since a replicate is simply not possible if the original material is not made again. Our results were reported recently here:
    https://www.pnas.org/content/117/2/877.short

    Having a material not repeated in the literature doesn’t imply that the original synthesis was faulty or non-repeatable. A more usual situation is likely to be that the research community “decided” in some way that the material wasn’t “interesting”.

  7. Well, I’m not here to bash my basic scientists friends. But, it seems to me that there is group-think going on just like in the social sciences. Maybe I am incapable of understanding this because I don’t have the sufficient biologic background. But, there is this hardware and software from Bio-Rad Labs that does quantitative PCR which many experiments use. It might be totally legit. However, all of the articles that I have read explaining what is going on are written by non-statisticians which maybe is understandable since we just don’t get it. Yet, it seems to me that there is far more variability than is controlled for in these idealized examples. And, the articles don’t really explain the outputs from the hardware/software (at least not to me satisfaction, but again; I don’t understand the biology either).

    • I haven’t thought about PCR in years, but my recollection is that one of the problems with using it is that the principal components are typically “interpreted” and given names accordingly, which can be very iffy. For example (if I remember correctly), in using PCR with data from “intelligence” tests, one typically comes up with a first component that is designated as “general intelligence” and denoted by G. Then the other components might be interpreted as more specific “intelligences”. Then sometimes these names become “reified”, thought of as things in themselves, “out there” rather than as fuzzy concepts that are just labels given to the principal components.

      • I think “PCR” in Scaredy Cat’s comment refers to polymerase chain reaction, a method to amplify DNA. You are probably mistaken that with “Principle Component Analysis”. As for Principle Component Analysis, this was what statistician Donald Mainland said, “If you don’t know what you’re doing, factor analysis is a great way to do it”.(The quote is from “Principles of Medical Statistics”, Alvan Feinstein, Page 623).

    • Quantitative PCR can indeed be a tricky system to get up and running. It’s one of those methods that until recently (with the advent of standardised reagents and automation) invoked the phrase “black magic” fairly often.

Leave a Reply to gec Cancel reply

Your email address will not be published. Required fields are marked *