Skip to content

Benefits and limitations of randomized controlled trials: I agree with Deaton and Cartwright

My discussion of “Understanding and misunderstanding randomized controlled trials,” by Angus Deaton and Nancy Cartwright, for Social Science & Medicine:

I agree with Deaton and Cartwright that randomized trials are often overrated. There is a strange form of reasoning we often see in science, which is the idea that a chain of reasoning is as strong as its strongest link. The social science and medical research literature is full of papers in which a randomized experiment is performed, a statistically significant comparison is found, and then story time begins, and continues, and continues—as if the rigor from the randomized experiment somehow suffuses through the entire analysis.

Here are some reasons why the results of a randomized trial cannot be taken as representing a general discovery:

1. Measurement. A causal effect on a surrogate endpoint does not necessarily map to an effect on the outcome of interest. . . .

2. Missing data. . . .

3. Extrapolation. The participants in a controlled trial are typically not representative of the larger population of interest. This causes no problem if the treatment effect is constant but can leads to bias to the extent that treatment effects are nonlinear and have interactions. . . .

4. Researcher degrees of freedom. . . .

5. Type M (magnitude) errors. . . .

Each of these threats to validity is well known, but they often seem to be forgotten, or to be treated as minor irritants to be handled with some reassuring words or a robustness study, rather than as fundamental limitations on what can be learned from a particular dataset.

One way to get a sense of the limitations of controlled trials is to consider the conditions under which they can yield meaningful, repeatable inferences. . . .

Where does this all leave us? Randomized controlled trials have problems, but the problem is not with the randomization and the control—which do give us causal identification, albeit subject to sampling variation and relative to a particular local treatment effect. So really we’re saying at all empirical trials have problems, a point which has arisen many times in discussions of experiments and causal reasoning in political science; see Teele (2014). I agree with Deaton and Cartwright that the best way forward is to integrate subject-matter information into design, data collection, and data analysis . . .

Once we recognize the importance of diverse sources of data, statistics can be helpful in making decisions and quantifying uncertainty. . . .

Again, here’s my discussion, and here’s the original article. I assume other discussions are coming soon.


  1. sentinel chicken says:

    Exactly how many RCTs have you designed, conducted, analyzed, interpreted and published in your career? Just curious.

    • Andrew says:


      Here’s what I wrote a few years ago:

      As a statistician, I was trained to think of randomized experimentation as representing the gold standard of knowledge in the social sciences, and, despite having seen occasional arguments to the contrary, I still hold that view, expressed pithily by Box, Hunter, and Hunter (1978) that “To find out what happens when you change something, it is necessary to change it.”

      At the same time, in my capacity as a social scientist, I’ve published many applied research papers, almost none of which have used experimental data.

      In the present article, I’ll address the following questions:

      1. Why do I agree with the consensus characterization of randomized experimentation as a gold standard?

      2. Given point 1 above, why does almost all my research use observational data?

      I no longer like the “gold standard” phrase, but it’s still true that nearly all my research uses observational data.

    • Martha (Smith) says:

      Sentinel chicken:

      Andrew’s sentence (in his post), “There is a strange form of reasoning we often see in science, which is the idea that a chain of reasoning is as strong as its strongest link. ” is the crux of the matter.

      Also, in many fields (e.g., evolutionary biology), RCT’s are (with rare exceptions) impossible.

  2. Carlos Ungil says:

    “So really we’re saying at all empirical trials have problems”

    Should the “at” be ‘’that”?

  3. Anon says:

    Thank you and I agree very much–especially about the story telling.

    But, I do wonder how many of these critiques are unique to RCTs. Put differently, they seem to plague observational studies just as much (sometimes more so, e.g., researchers dof). The one difference between the two, then, is randomization, which, to my mind, is strictly preferable (if possible). Hence, the interest in RCTs.

  4. Thomas B says:

    In all too many cases, the demand for ‘confirmatory’ statistical evidence is a red herring. Consider the tactics employed for decades by the pro-smoking lobby in successfully blocking anti-smoking legislation. By underscoring the lack of statistical tests of significance based on the results of an RCT, the gold standard of proof for many 20th c, hard-core empiricists such as R. A. Fisher, anti-smoking legislation was derailed until more commonsense standards prevailed. Similar tactics are in use today by the anti-global warming lobby.

    The point is that many of the most important discoveries in the history of science have not relied on either RCTs or tests of significance. For instance, astronomy is a foundational scientific endeavor whose discoveries, by definition, are not and cannot be based on tests of significance. It is impossible to conduct an RCT with the cosmos! (At least, to date. It may be the case that some brilliant astrophysicist will yet figure out how to conduct such an experiment).

    Next, consider John Snow’s map of mid-19th c London’s cholera epidemics ( Nowhere does it contain a test of significance. Regardless, he conclusively demonstrated the loci of contagion, resulting in the elimination of cholera as a threat.

    Then, too, much of Louis Pasteur’s work in bacteriology and hygiene had nothing to do with significance tests (e.g., Bruno Latour’s book, The Pasteurization of France).

    Other examples abound but these few suffice.

    • zbicyclist says:

      I remember one college social science class in which the instructor explained that astronomy wasn’t a science because you couldn’t do experiments. Psychology, of course, was a science because you could do experiments. Even at the time (and as a social science major) this triggered my BS meter (which is why I remember the episode decades later).

    • Bob says:


      for a discussion of significance-test-like statistics in astronomy. Of course, a p-value of something like 10^-6 is somewhat more compelling than one of 0.04—-even if one does not believe in significance tests.


      • Martha (Smith) says:

        Also, if I’m not mistaken, observational astronomy was one of the first (or possibly the first?) field to use random factors — since observations were influenced not just by what the astronomical body of interest was doing, but also by other things such as atmospheric conditions.

        • zbicyclist says:

          And it’s not by accident that this Big Book of Outliers has a galaxy on its cover:
          (Barnett and Lewis)

          As someone who isn’t an astronomer, accurate measurement seems to me to be the essence of the science.
          No measurement, no Kepler. No Kepler, maybe no Newton — and so on.

        • Bob says:

          Well, it also appears to have been an inspiration for LMS regression. Wikipedia says

          An early demonstration of the strength of Gauss’ method came when it was used to predict the future location of the newly discovered asteroid Ceres. On 1 January 1801, the Italian astronomer Giuseppe Piazzi discovered Ceres and was able to track its path for 40 days before it was lost in the glare of the sun. Based on these data, astronomers desired to determine the location of Ceres after it emerged from behind the sun without solving Kepler’s complicated nonlinear equations of planetary motion. The only predictions that successfully allowed Hungarian astronomer Franz Xaver von Zach to relocate Ceres were those performed by the 24-year-old Gauss using least-squares analysis.



  5. Paul Alper says: is an incredibly valuable resource when it comes to medical matters. Its reviewers dig deep to ensure that readers are aware of such matters as

    1. Was the study done only on mice and not on humans
    2. Was the risk relative or absolute
    3. Was the study observational or experimental
    4. Was the reporting from a PR blurb only
    5. Were the authors connected in some financial way to the outcome
    6. The cost and the harms associated with the study
    7. What was not reported

    Two items posted on the website today are where we find:

    In other words, MigraineX was hardly “clinically proven” to work. And none of the news stories we saw pointed out any of these caveats out to readers. After looking at the study, Adam Cifu, MD, didn’t mince words. Via email, he described it as “utterly worthless.” Cifu is a professor of medicine at the University of Chicago and a contributor.

    and which says

    The push for ‘early detection’ leads to more scrutiny-dependent cancers being found which, in turn, gives the false impression of an increased incidence of some cancers.
    Aggressive screening of the family members of someone with cancer means more cancer will be found. This could give the impression of family history being more of a risk factor than it may actually be.

  6. Mark Palko says:

    I think “Randomized controlled trials have problems, but the problem is not with the randomization and the control—which do give us causal identification, albeit subject to sampling variation and relative to a particular local treatment effect. So really we’re saying at all empirical trials have problems, ” badly frames the central question. Randomized controlled trials are not just “overrated”; they are often seen as the final word in scientific research. The status as a “gold standard” makes their potential flaws especially dangerous.

    It’s true that a good, well-designed RCT beats pretty much everything else, but we need to get across to the general public and the press the idea that a weak RCT might be less convincing than strong observational studies (and that a really bad RCT can be worth less than anecdotal data and common sense).

  7. Mark Palko says:

    We also need to throw in something about peer/placebo effects and non-blinded RCTs. I’ve seen numerous writers explicitly equate studies where the subjects interact and know who’s in the treatment group with double blinded drug trials. Worse yet, I’ve seen researchers either ignore potential peer and placebo effects or wave them away.

    One common technique for the latter is to come up with an unrealistically narrow definition then say “we couldn’t find it so therefore it does not exist.” (Not finding a relationship you don’t want to find is remarkably easy.)

  8. Uri says:

    One more reason why why the results of an RCT cannot be taken as representing a general discovery, at least in medicine: the way treatment is administered in an RCT might be different. In a surgical procedure, more care might be given to do it “just right”. In medication studies, dosage might be more strictly controlled. For medical devices, technical assistance from the manufacturer might be more readily available.

  9. Sander Greenland says:

    Great discussion, sorry for tardy posting here… however this guest post by Senn (who has done many RCTs) takes up some important technical issues with Deaton & Cartwright which I didn’t notice discussed above:

Leave a Reply