“Repeating the experiment” as general advice on data collection

Izzy Kates points to the above excerpt from Introductory Statistics, by Neil Weiss, 9th edition, and points out:

Nowhere is repeating the experiment mentioned. This isn’t the only time this mistake is made.

Good point! We don’t mention replication as a statistical method in our books either! Even when we talk about the replication crisis, and the concern that certain inferences won’t replicate on new data, we don’t really present replication as a data-collection strategy. Part of this is that in social sciences such as economics and political science, it’s rarely possible to do a direct replication—the closest example would be when we have a time series of polls, but in that case we’re typically interested in changes over time, so these polls are replications of methods but with possible changes in the underlying thing being measured.

I agree with Kates that if you’re going to give advice in a statistics book about data collection, random sampling, random assignment of treatments, etc., you should also talk about repeating the entire experiment. The problem is nothing special to Weiss’s book. I don’t know that I’ve ever seen a statistics textbook recommend repeating the experiment as a general method, in the same way they recommend random sampling, random assignment, etc.

Remember the 50 Shades of Gray story, in which a team of researchers had a seemingly strong experimental finding but then decided to perform a replication, which gave a null result, making them realize how much they were able to fake themselves out with forking paths.

Or, for a cautionary tale, the 64 Shades of Gray story, in which a different research team didn’t check their experimental work with a replication, thus resulting in the publication of some pretty ridiculous claims.

So, my advice to researchers is: If you can replicate your study, do so. Better to find the mistakes yourself than to waste everybody else’s time.

P.S. We’re still making final edits on Regression and Other Stories. So I guess I should add something there.

19 thoughts on ““Repeating the experiment” as general advice on data collection

  1. One measurement I repeated after several years had elapsed uncovered a real but previously unknown (to me) change over time. This turned out to be important for my company to know about.

  2. there is a related aspect of meta-analysis that is affected by this. In meta-analysis several studies aiming at estimation a common effect are considered, and combines.

    these studies have a temporal characteristic that is not accounted for in the meta analysis.

    some are actual repeats, or semi repeats.

    so, meta analysis is an investigation of repeated experiments without considering the sequential aspect of it….

  3. What do you mean by replication in this context? Weiss’ third point states that you should have sufficient observations to answer your research question. A lab experiment in economics will often be conducted over multiple sessions (due to space constraints in the lab) with different participants in each session, and this makes the definition of a single experiment somewhat arbitrary. Let’s say I need to run X sessions to get enough participants given my design, expected effect and desired power. If I replicated the entire experiment using the exact same materials I would basically be multiplying my initial sessions by some replication number K. So if I decide to save time and do this right away, running K*X sessions and analyzing them jointly, is that 1 experiment with a lot of power or K replications of a single experiment? Is the point that I need to run these sessions in different places using students from different faculties or birth cohorts or countries and with slightly different designs because of effect heterogeneity or to ensure robustness in some way? If that is the concern, it seems like it should be OK to take that into account during the design phase by running the experiment with sessions conducted at different times, with different types of participants etc.

  4. Isn’t “repeating the experiment” implied in “replication” (item 3)?

    I really don’t understand the objection that “Nowhere is repeating the experiment mentioned. This isn’t the only time this mistake is made.” How is this a mistake? If I “replicate” my experiment by watching a bunch of cells grown in different batches on different days, thereby “[increasing] the chances of detecting any differences among the treatments,” am I not “repeating? it?

    • Raghu:

      I think that “repeating the experiment” implies doing the whole thing again, not just collecting more data. From a statistical modeling standpoint, repeating the entire experiment can be seen as taking a new sample of the “error terms” corresponding to various aspects of data collection, not just the variation seen within a single study. And this is all in addition to the advantages of a fresh perspective and avoidance of forking paths.

      This is a big deal. Increasing N does give some internal replication but it is not in general the same as external replication. Again, see the 50 shades of gray story for an example.

      • I see your point — that in many cases we don’t know the “error terms” underlying the data collection process, and repeating the whole thing helps deal with this. I like the 50 shades of gray story example.

        However: I still don’t think this should be elevated to the status of a “rule” equivalent in magnitude to #3 on the list. Continuing my cell analogy, I *could* repeat the experiment and ask completely different people to prepare the cells’ growth media, use a different incubator, buy different lenses and lasers for my microscope, etc., to average over potential differences in all these things. It would, however, be a waste of time — we understand these aspects well enough. But there may be other experiments with cells for which it would be worthwhile — many 3D cell culture experiments rely on poorly understood, near magical components of growth media, with not only particular suppliers but particular batches “working” or not. Whether the repetition is worthwhile depends on the context, and isn’t, I would argue, a general rule.

        One could argue that in the social science, nothing is well understood, so everything should be repeated; that’s fine.

        There’s another related point that increasingly hits me, that the scale required to do these studies properly (i.e. repetitions, large groups, many years) is incompatible with the way we structure academic science (short-term funding, publishing constantly, etc.). But this would take longer to elaborate, and I’ve already wasted too many hours this week on a blog post, so it will have to wait…

    • I think the difference between “replicated experiment” and “more data” is more nuanced. Stanley Lazik “Experimental Design for Laboratory Biologists” has some discussion similar to this but I cannot find explicit “replicate the experiment”. But his “fundamental equation of experimental design” is useful for thinking about “replicated experiment” v “more data”. The outcome includes a component due to “technical effects” including Technician, Batch, Plate, Cage, Array, Day, Order, Source (of animals, of chemicals, etc.). I’d say, the more of these that differ between data collection bouts, the more the experiment is “replicated” rather than simply “data added”.

      Examples:
      1. If “more data” is collected by another research lab after the original experimental results are published, then many of the technical effects will differ, one possible exception being “source”. I don’t think anyone would argue that this isn’t a replicated experiment.
      2. If more data is collected by the same research lab one year later, after the original data are published, I don’t think anyone would argue that this isn’t replication, even if the technician is the same (more often this would be done by a different Ph.D. student who is replicating prior work to kickstart a new direction).
      3. If more data is collected the next day by the same technician but using say a newly prepared batch of culture medium, I think many wet-bench biologists would consider this a replicated experiment (the whole experiment was replicated) – at least this is the language used in many papers.
      4. If more data is collected the next day by the same technician but using the same preparation of culture medium, is this a replication or more data? Certainly there is a day random effect.

      Another paper that clarifies something like #3 above is Vaux, D.L., Fidler, F. and Cumming, G., 2012. Replicates and repeats—what is the difference and is it significant?. EMBO reports, 13(4), pp.291-296 but these authors use “replicates” to mean the multiple measures within a cluster.

  5. The textbook author wrote “Replication” where they should have written “Replicability.”

    Replication refers specifically to doing the experiment again. Increasing sample size does not increase replication–although it may increase replicability. You could argue, in theoretical terms, that if the effect shows up n times in a group of size n then you have replicated the effect n times. You *could* argue that, but you’d be wrong.

    In theoretical terms, an effect is the difference between the treatment outcome and the control outcome for a hypothetical individual who has been assigned to both groups. That’s impossible to achieve so, as Rubin explains, we assign different people to the two conditions with the long-run expectation that their mean group difference will converge to the hypothetical individual difference. No matter how big your sample, you are estimating only a single effect.

    On the more general point: my opinion is that replication is not a principle of experimental design, it’s a stage in the scientific method. It is awesome if you can proceed to this stage prior to publication, but if you don’t, you haven’t violated a design principle.

  6. Is it possible to achieve some of the beneficial effects of repeating the experiment by simply using a set aside test set / validation set?

    A researcher could simply split their dataset in half and do their initial analysis (with all the forking path / other issues) on the training set and only perform the final statistical test on the test set (in theory only once to avoid multiple hypothesis testing issues).

    • I think the idea behind repeating an experiment is to set it up and tear it down entirely… this is perhaps more intuitive to lab guys… For example in a bio-lab get an entirely different tech to follow the written protocol and carry out all the steps, maybe even at a different location… You learn how much of what happened had to do with the fact that you always had Fred doing the experiment, it was in the winter, the light was mostly artificial, the temperatures were hot and dry indoors due to the heating, vs Marcia doing it in Florida where it’s hot and humid, there’s lots of sunlight… or whatever.

      • Before I entered university, I assumed this was how (lab) science was done. It was later that I found out it didn’t matter as long as the results were published.

  7. IMO repeating the experiment makes just about all criticisms of single p-values vanish too (or BFs, posterior probs, or any favorite statistic of choice). As Mayo writes (paraphrasing here), a single p-value is just an ‘indication’ whether large or small. We need to repeat experiments and look at them over time or summarized to get an idea of what is going on for a phenomenon. A prior-izing your way out of it can work if the aprioris are based on data from experiments in the first place.

    Justin

  8. > So, my advice to researchers is: If you can replicate your study, do so. Better to find the mistakes yourself than to waste everybody else’s time.

    Eh.

    If you are going to talk about ‘repeating the experiment’, isn’t it better for *someone else to repeat your experiment*? The original team repeating the experiment has a strong risk of repeating the mistakes found in the initial experiment. The so-called replication crisis, for instance, is in many cases actually a crisis of generalisation, and re-running some study done on a bunch of students at Untitled University on another cohort of students is highly likely to miss that sort of issue.

    • Zhou:

      Sure, it’s better for others to do the replication. If you can find someone else to replicate your experiment, go for it! But, in the meantime, I suggest replicating the experiment first yourself if you can do so.

Leave a Reply to Andrew Cancel reply

Your email address will not be published. Required fields are marked *