Skip to content

Field Experiments and Their Critics

Seven years ago I was contacted by Dawn Teele, who was then a graduate student and is now a professor of political science, and asked for my comments on an edited book she was preparing on social science experiments and their critics.

I responded as follows:

This is a great idea for a project. My impression was that Angus Deaton is in favor of observational rather than experimental analysis; is this not so? If you want someone technical, you could ask Ed Vytlacil; he’s at Yale, isn’t he? I think the strongest arguments in favor of observational rather than experimental data are:

(a) Realism in causal inference. Experiments–even natural experiments–are necessarily artificial, and there are problems in generalizing beyond them to the real world. This is a point that James Heckman has made.

(b) Realism in research practice. Experimental data are relatively rare, and in the meantime we have to learn with what data we have, which are typically observational. This is the point made by Paul Rosenbaum, Don Rubin, and others who love experiments, see experiments as the gold standard, but want to make the most of their observational data. You could perhaps get Paul Rosenbaum or Rajeev Dehejia to write a good paper making this point–not saying that obs data are better than experimental data, but saying that much that is useful can be learned from obs data.

(c) The “our brains can do causal inference, so why can’t social scientists?” argument. Sort of an analogy to the argument that the traveling salesman problem can’t be so hard as all that, given that thousands of traveling salesmen solve the problem every day. The idea is that humans do (model-based) everyday causal inference all the time (every day, as it were), and we rarely use experimental data, certainly not the double-blind stuff we do all the time. I have some sympathy but some skepticism with this argument (see attached article), but if you wanted someone who could make that argument, you could ask Niall Bolger or David Kenny or some other social psychologist or sociologist who is familiar with path analysis. Again, I doubt they’d say that observational data are better than the equivalent experiment, but they might point out that, realistically, “the equivalent experiment” isn’t always out there, and the observational data are.

(d) This issue also arises in evidence-based medicine. As far as I can tell, there are three main strands of evidence-based medicine: (i) using randomized controlled trials to compare treatments, (ii) data-based cost-benefit analyses (Qalys and the like), (iii) systematic collection and analysis of what’s actually done (i.e., observational data), thus moving medicine into a total quality control environment. You could perhaps get someone like Chris Schmid (a statistician at New England Medical Center who’s a big name in this field) to write an article about this (giving him my sentence above to give you a sense of what you’re looking for).

(e) An argument from a completely different direction is that _experimentation_ is great, but formal _randomized trials_ are overrated. The idea is that these formal experiments (in the style of NIH or, more recently, the MIT poverty lab) would be fine in and of themselves except that they (i) suck up resources and, even more importantly, (ii) dissuade people from doing everyday experimentation that they might learn from. The #1 proponent of this view is Seth Roberts, an experimental psychologist who’s written on self-experimentation.

I’d be happy to write something expanding (briefly) on the above points. I don’t feel so competent in the area to actually take any strong positions but I’d be glad to lay out what I consider are some important issues that often get lost in the debate.

A few months later I sent in my chapter, which begins:

As a statistician, I was trained to think of randomized experimentation as representing the gold standard of knowledge in the social sciences, and, despite having seen occasional arguments to the contrary, I still hold that view, expressed pithily by Box, Hunter, and Hunter (1978) that “To find out what happens when you change something, it is necessary to change it.”

At the same time, in my capacity as a social scientist, I’ve published many applied research papers, almost none of which have used experimental data.

In the present article, I’ll address the following questions:

1. Why do I agree with the consensus characterization of randomized experimentation as a gold standard?

2. Given point 1 above, why does almost all my research use observational data?

In confronting these issues, we must consider some general issues in the strategy of social science research. We also take from the psychology methods literature a more nuanced perspective that considers several different aspects of research design and goes beyond the simple division into randomized experiments, observational studies, and formal theory.

A few years later the book came out.

I’ve blogged on this all before, but just recently the journal Perspectives on Politics published a symposium with several reviews of the book (from Henry Brady, Yanna Krupnikov, Jessica Robinson Preece, Peregrine Schwartz-Shea, and Betsy Sinclair), and I thought it might interest some of you.

In her review, Sinclair writes, “The arguments in the book are slightly dated . . . Seven years later, there is more consensus within the experimental community about the role experiments play in addressing a research question.” I don’t quite agree with that; I think the issues under discussion remain highly relevant. I hope that soon we shall reach a point of consensus, but we’re not there yet.

I certainly would not want to join in any consensus that includes some of the more controversial Ted-talk-style experimental claims involving all the supposedly irrational influences on voting, for example. The key role of experimentation in such work is, I think, not scientific so much as meta-scientific: when a study is encased in an experimental or quasi-experimental framework, it can seem more like science, and then the people at PPNAS, NPR, etc., can take it more seriously. My recommendation is for experimentation, quasi-experimentation, and identification strategies more generally to be subsumed within larger issues of statistical measurement.


  1. Tim Ogden says:

    About the same time I was starting a similar project. It took a few years to get it off the ground, I was focused on development economics, and wanted to do it as interviews rather than essays, and that book is finally coming out now. Actually on Friday. It’s here, from MIT Press:
    Interviewees include the obvious names from development economics, but also Deaton, Pritchett, Judy Gueron and other users and funders of RCTs and field experiments.

  2. Shravan says:

    Great point. I was also trained to think of experiments as the gold standard and (implicitly) the only thing worth doing. This might also relate to the supposed distinction between confirmatory and exploratory analysis. Psych*ists often think that they are doing pure confirmatory hypothesis testing and that has a special status. This is not to say that RCTs and planned experiments should be abandoned of course (in fact, that’s all I will ever do in the next 15 years). Consistent replicability across methodologies and across planned and observational studies seems like the best thing to aim for.

  3. Rahul says:

    I think its the wrong framing:

    It’s not about whether experiments are better or worse but about whether you are doing an experiment that mimics with high fidelity the crucial parts of the phenomenon & whether you have the ability to identify and control for the external variables that may impact the results.

    If an Experiment means 100 students just because it is convenient obviously you get crap.

    • shravan says:

      This is also true. Bur it is not just about number of subjects. At least in psych* there are always possible sources of unnaturalness. It’s important to think about ecological validity. I ususallyignore this issue when i show subjects ungrammatical sentences, hoping to learn how their parsing system works. In linguistics things can get bizarrely extreme in this respect.

      • Rahul says:

        Indeed, it’s not just about numbers.

        e.g. Extrapolating from whether online survey subjects cheat on a silly $10 task to national honesty indices etc.

        • Shravan says:

          Yeah, that’s a good example of where the “scaling up” from an experimental setting to a real life scenario is very dicey. I hope behavioral economics is paying attention to this problem.

          • Another case in point (though a separate issue) is “action research.” If I do something differently in the classroom and see results, this experiment may inform my practice but is not generalizable to a larger context (without additional research). Action research often doesn’t qualify as observation, since the researcher is participating in and influencing the environment. It isn’t useless–we probably perform some informal variant of it every day–but it needs appropriate (strong) qualification.

            • Elin says:

              I’m always thinking about what my students take away from courses, and I don’t want them to take away the idea that it’s not worth systematically collecting and looking at data on what happens when they try different approaches in their professional contexts after graduation. For example, since some will become elementary school teachers, that instead of just feeling like something works (e.g. having kids write their spelling words every day improves results on the Friday spelling test), systematically record what you did over many weeks and what the results were and put it on a graph. You’re not going to publish a research article but that’s not the point of most data collection that people do.

              • Elin,

                I wasn’t suggesting in the least that teachers shouldn’t systematically examine their practices and results! But there can be blind spots even in systematic examination. It’s essential to keep the uncertainties and questions open. It may seem, for instance, from systematic examination that students remember vocabulary words better when they study them in context. But it may also be that students remember (and understand) them even better when they encounter them in context and on a list. So while gathering and analyzing data, one should also analyze the uncertainties.

                I have seen action research projects that basically confirmed what the researcher wanted to find out in the first place. Most of us are susceptible to that, especially when we’re invested in the environment. So it’s important to be as rigorous with the questioning as with the rest.

              • Rahul says:

                I think the key point is to realize that just because *your* data shows something does not automatically mean it is generalizable to all situations.

                Wisdom lies in getting a feel for when results may be generalizable and when not.

              • Shravan says:

                Rahul wrote: “Wisdom lies in getting a feel for when results may be generalizable and when not.”

                But in psych* settings, what does it mean if a result is not generalizable? This is what I think people mean when they write, “the effect was reliable p less than 0.05”. In the limiting case where every experiment is a unique result, never to be replicated, we have little to talk about.

              • Elin says:


                I know, I actually was mainly agreeing with you. Teachers should collect the best data they can in their context (and social workers and coffee shop managers) but they should not over generalize or think they can decisively show something. They can also do this in better or worse ways, such as doing within-subject multiple measurements over time before and after they try the strategy. Again, not “big science” but doing the best you can.

            • Andrew says:


              I’d categorize action research (as you describe it) as experimentation. As I discuss in my article, experimentation does not have to include randomization. And, as I’ve written elsewhere, a key step in making this work is to have good measurement. In college teaching that I’ve seen, we rarely have good post-treatment measures—final exams are too unstable from year to year and are typically not comparable between classes—and I’ve almost never seen pre-treatment measures. It would not be hard to give a pre-test at the beginning of the semester but is rarely done.

              • Rahul says:

                >>>experimentation does not have to include randomization<<<

                When I think about it, most experiments we did in the hard sciences, e.g. Physics or Chemistry did not have any randomization at all right?

                e.g. Measuring the young's modulus, or titrating an unknown solution, or finding the breakdown voltage of a diode etc.

                There's sometimes a blank experiment to subtract out background or a calibration run to make sure everything is ok. And almost always there's repeat runs to evaluate internal repeatability.

                But very rarely randomization.

              • Elin says:

                Final exams are generally not designed to measure teaching or pedagogical effectiveness. Probably most of the time they mainly reward amount of time spent studying. Could there be a teaching ->time-> exam grade connection, maybe but it would require some effort to measure.

                Rahul: I think that in physics and chemistry they assume all molecules of a particular type are identical and they do not need to worry about the impact of for example prior life history of a molecule on the results of the experiments. So there is no need to randomize since the point of randomization is to eliminate rival causal factors. This is why physicists hate it when people talk about the chances that you are breathing the same oxygen molecule as Aristotle. They just don’t think you can differentiate them in that way.

              • Martha (Smith) says:

                @ Elin:

                There’s terminology for the distinction you are describing: Different molecules of the same chemical can be considered “fungible,” whereas different humans (or different rats) cannot.

              • Rahul says:


                That’s the crux I think.

                In many Soc. Sci. experiments researchers generalize results without sufficient evidence as to why they would hold to entities not fungible with their experimental subjects.

          • Lisheng says:

            Shravan, I totally agree with you. This issue is particularly important for experimental economics, where models are built on data from mostly unrepresentative sets of stimuli. There is a small voice of “representative design” though (e.g.,

    • Shravan says:

      BTW 100 students is a yuge number in my field! Usually we are very satisfied with 20-30 subjects. Thank god for Type S and M errors or nothing would come out significant with this number.

      • Joe says:

        I think a related problem is that it’s not always clear what is the larger population from which the sample is said to be drawn. I work primarily in sociolinguistics, and while the problem there is one of measurement (social factors are probably continuous, but most work treats them as categorical), it’s at least somewhat clear what the sample is supposed to be a sample of (say, Liverpool English). However, I’m never sure when I read psycholinguistic papers what the sample is supposed to generalise to. I might be wrong about that, but, if I’m not, how do we draw inferences from such samples?

        Any thoughts on that?

  4. Julian says:

    I stumbled across the statement that “most causal questions cannot be addressed through random experimentation” (!) in the new Pearl/Glymour/Jewell primer on Causal Inference. I’m not entirely sure about the “most”, but certainly important questions on mediation/causal mechanisms and quantities like the ATT are not automatically identified even in ideal experiments. It’s irritating that no one in the symposium points this out.

    • Andrew says:


      Yes, for example we can’t design an experiment to directly assess the effects of cigarette smoking because it would be unethical to force people to smoke. And we can’t design an experiment to assess historical questions of the “What would’ve happened if” variety. You find it irritating that no one in the symposium pointed this out, but maybe they didn’t point it out because the point is so well known. For example in my article I wrote, “I think experiments really are a better choice when we can do them.” The phrase “when we can do them” implies that there are settings where we can’t!

    • shravan says:

      Can one buy this book as a pdf like o’reilly books? Kindle version has poorly displayed equations. Wiley sells an ebook but not clear if this is a standalone pdf.

  5. Richard McElreath says:

    We’re discussing one of Angus Deaton’s papers in my department journal club tomorrow. Would have been a good idea to pair it with Andrew’s chapter as well. If readers have other suggestions for similar papers, would love to see them.

  6. BenK says:

    In evolutionary biology, naturally, the challenge of ‘experiment’ vs ‘observation’ comes up pretty regularly.
    Rich Lenski, not to diminish his contributions, has made a career of imagining an experiment that nobody thought
    practical to conduct – decades of daily samples.

  7. Keith O'Rourke says:

    > Sinclair writes, “The arguments in the book are slightly dated

    This does sound a bit like “a youth over old age sort of idea. Here’s a[re] new tool[s and collaborations], we can rethink the world with it.” from Ogden’s interview with Deaton

  8. Brian says:

    “(a) Realism in causal inference. Experiments–even natural experiments–are necessarily artificial, and there are problems in generalizing beyond them to the real world. This is a point that James Heckman has made.”

    Where has Heckman made this point? His 2009 Science paper made quite a different argument, e.g.

    “There is also a widespread view that the lab produces unrealistic data, which lacks relevance for understanding the “real world.” This notion has its basis in an implicit hierarchy in terms of generating relevant data, with field data being superior to lab data. We argue that this view, despite its intuitive appeal, has its basis in a misunderstanding of the nature of evidence in science and of the kind of data collected in the lab.”

    • Elin says:

      Reading a pdf version

      “We conclude by restating our argument. Causal knowledge requires controlled
      variation. In recent years, social scientists have hotly debated which form of controlled
      variation is most informative. This discussion is fruitful and will continue. In this context it is
      important to acknowledge that empirical methods and data sources are complements, not
      substitutes. Field data, survey data, and experiments, both lab and field, as well as standard
      econometric methods can all improve the state of knowledge in the social sciences. There is
      no hierarchy among these methods and the issue of generalizability of results is universal to
      all of them. “

  9. Jonathan Gilligan says:

    Another very insightful and well-written book on experiments is Nancy Cartwright and Jeremy Hardie, “Evidence-Based Policy: A Practical Guide to Doing it Better” (Oxford, 2012). The approach is qualitative, rather than quantitative, but offers (in my view) excellent advice about how to think critically about experiments and inferences drawn from them, especially the challenge of how much one can generalize from certain experiments in certain times and places to guide policy design in a different time and place.

Leave a Reply