Skip to content

I think that science is mostly “Brezhnevs.” It’s rare to see a “Gorbachev” who will abandon a paradigm just because it doesn’t do the job. Also, moving beyond naive falsificationism

Sandro Ambuehl writes:

I’ve been following your blog and the discussion of replications and replicability across different fields daily, for years. I’m an experimental economist. The following question arose from a discussion I recently had with Anna Dreber, George Loewenstein, and others.

You’ve previously written about the importance of sound theories (and the dangers of anything-goes theories), and I was wondering whether there’s any formal treatment of that, or any empirical evidence on whether empirical investigations based on precise theories that simultaneously test multiple predictions are more likely to replicate than those without theoretical underpinnings, or those that test only isolated predictions.

Specifically: Many of the proposed solutions to the replicability issue (such as preregistration) seem to implicitly assume one-dimensional hypotheses such as “Does X increase Y?” In experimental economics, by contrast, we often test theories. The value of a theory is precisely that it makes multiple predictions. (In economics, theories that explain just one single phenomenon, or make one single prediction are generally viewed as useless and are highly discouraged.) Theories typically also specify how its various predictions relate to each other, often even regarding magnitudes. They are formulated as mathematical models, and their predictions are correspondingly precise. Let’s call a within-subjects experiment that tests a set of predictions of a theory a “multi-dimensional experiment”.

My conjecture is that all the statistical skulduggery that leads to non-replicable results is much harder to do in a theory-based, multi-dimensional experiment. If so, multi-dimensional experiment should lead to better replicability even absent safeguards such as preregistration.

The intuition is the following. Suppose an unscrupulous researcher attempts to “prove” a single prediction that X increases Y. He can do that by selectively excluding subjects with low X and high Y (or high X and low Y) from the sample. Compare that to a researcher who attempts to “prove”, in a within-subject experiment, that X increases Y and A increases B. The latter researcher must exclude many more subjects until his “preferred” sample includes only subjects that conform to the joint hypothesis. The exclusions become harder to justify, and more subjects must be run.

A similar intuition applies to the case of an unscrupulous researcher who tries to “prove” a hypothesis by messing with the measurements of variables (e.g. by using log(X) instead of X). Here, an example is a theory that predicts that X increases both Y and Z. Suppose the researcher finds a Null if he regresses X on Y, but finds a positive correlation between f(X) on Y for some selected transformation f. If the researcher only “tested” the relation between X and Y (a one-dimensional experiment), the researcher could now declare “success”. In a multi-dimensional experiment, however, the researcher will have to dig for an f that doesn’t only generate a positive correlation between f(X) and Y, but also between f(X) and Z, which is harder. A similar point applies if the researcher measures X in different ways (e.g. through a variety of related survey questions) and attempts to select the measurement that best helps “prove” the hypothesis. (Moreover, such a theory would typically also specify something like “If X increases Y by magnitude alpha, then it should increase Z by magnitude beta”. The relation between Y and Z would then present an additional prediction to be tested, yet again increasing the difficulty of “proving” the result through nefarious manipulations.)

So if there is any formal treatment relating to the above intuitions, or any empirical evidence on what kind of research tends to be more or less likely to replicate (depending on factors other than preregistration), I would much appreciate if you could point me to it.

My reply:

I have two answers for you.

First, some colleagues and I recently published a preregistered replication of one of our own studies; see here. This might be interesting to you because our original study did not test a single thing, so our evaluation was necessarily holistic. In our case, the study was descriptive, not theoretically-motivated, so it’s not quite what you’re talking about—but it’s like your study in that the outcomes of interest were complex and multidimensional.

This was one of the problems I’ve had with recent mass replication studies, that they treat a scientific paper as if it has a single conclusion, even though real papers—theoretically-based or not—typically have many conclusions.

My second response is that I fear you are being too optimistic. Yes, when a theory makes multiple predictions, it may be difficulty to select data to make all the predictions work out. But on the other hand you have many degrees of freedom with which to declare success.

This has been one of my problems with a lot of social science research. Just about any pattern in data can be given a theoretical explanation, and just about any pattern in data can be said to be the result of a theoretical prediction. Remember that claim that women were three times more likely to wear red or pink clothing during a certain time of the month? The authors of that study did a replication which failed–but they declared it a success after adding an interaction with outdoor air temperature. Or there was this political science study where the data went in the opposite direction of the preregistration but were retroactively declared to be consistent with the theory. It’s my impression that a lot of economics is like this too: If it goes the wrong way, the result can be explained. That’s fine—it’s one reason why economics is often a useful framework for modeling the world—but I think the idea that statistical studies and p-values and replication are some sort of testing ground for models, the idea that economists are a group of hard-headed Popperians, regularly subjecting their theories to the hard test of reality—I’m skeptical of that take. I think it’s much more that individual economists, and schools of economists, are devoted to their theories and only rarely abandon them on their own. That is, I have a much more Kuhnian take on the whole process. Or, to put it another way, I try to be Popperian in my own research, I think that’s the ideal, but I think the Kuhnian model better describes the general process of science. Or, to put it another way, I think that science is mostly “Brezhnevs.” It’s rare to see a “Gorbachev” who will abandon a paradigm just because it doesn’t do the job.

Ambuehl responded:

Anna did have a similar reaction to you—and I think that reaction depends much on what passes as a “theory”. For instance, you won’t find anything in a social psychology textbook that an economic theorist would call a “theory”. You’re certainly right about the issues pertaining to hand-wavy ex-post explanations as with the clothes and ovulation study, or “anything-goes theories” such as the Himicanes that might well have turned out the other way.

By contrast, the theories I had in mind when asking the question are mathematically formulated theories that precisely specify their domain of applicability. An example of the kind of theory I have in mind would be Expected Utility theory, tested in countless papers, e.g. here). Another example of such a theory is the Shannon model of choice under limited attention (tested, e.g., here). These theories are in an entirely different ballpark than vague ideas like, e.g., self-perception theory or social comparison theory that are so loosely specified that one cannot even begin to test them unless one is willing to make assumptions on each of the countless researcher degrees of freedom they leave open.

In fact, economic theorists tend to regard the following characteristics virtues, or even necessities, of any model: precision (can be tested without requiring additional assumptions), parsimony (and hence, makes it hard to explain “uncomfortable” results by interactions etc.), generality (in the sense that they make multiple predictions, across several domains). And they very much frown upon ex post theorizing, ad-hoc assumptions, and imprecision. For theories that satisfy these properties, it would seem much harder to fudge empirical research in a way that doesn’t replicate, wouldn’t it? (Whether the community will accept the results or not seems orthogonal to the question of replicability, no?)

Finally, to the extent that theories in the form of precise, mathematical models are often based on wide bodies of empirical research (economic theorists often try to capture “stylized facts”), wouldn’t one also expect higher rates of replicability because such theories essentially correspond to well-informed priors?

So my overall point is, doesn’t (good) theory have a potentially important role to play regarding replicability? (Many current suggestions for solving the replication crisis, in particular formulaic ones such as pre-registration, or p<0.005, don't seem to recognize those potential benefits of sound theory.)

I replied:

Well, sure, but expected utility theory is flat-out false. Much has been written on the way that utilities only exist after the choices are given. This can even be seen in simple classroom demonstrations, as in section 5 of this paper from 1998. No statistics are needed at all to demonstrate the problems with that theory!

Amdahl responded with some examples of more sophisticated, but still testable, theories such as reference-dependent preferences, various theories of decision making under ambiguity, and perception-based theories, and I responded with my view that all these theories are either vague enough to be adaptable to any data or precise enough to be evidently false with no data collection needed. This was what Lakatos noted: any theory is either so brittle that it can be destroyed by collecting enough data, or flexible enough to fit anything. This does not mean we can’t do science, it just means we have to move beyond naive falsificationism.

P.S. Tomorrow’s post: “Boston Globe Columnist Suspended During Investigation Of Marathon Bombing Stories That Don’t Add Up.”


  1. So are any decision theories deemed sound? I ask b/c above earlier in Andrew’s post, Sandro suggested that Andrew had highlighted the importance of sound theories.

    Vague as some theories are and can be adaptable to any data or evidently false, as Andrew characterizes, then, why even use citations, which serve the purpose of evidence for inferences/claims made.

  2. Newtonian mechanics is false, Both General Relativity and Quantum effects falsify it. But it’s still the theory used to launch rockets, build dams and bridges, design cars, helicopters, drones, and soforth…

    There’s nothing wrong with being false provided the size of the error is sufficiently small. This is why “testing” is a dead fish and estimation is the soul of Statistics.

  3. gec says:

    I heartily agree with your correspondent on the value of good theories, but it seems to me that the essential connection between good theory AND good conclusions is good experimental design. In other words, a good theory is one for which it is possible to find an experiment in which that theory makes a set of predictions that uniquely identifies it. Of course, this has to be coupled to good measurement as well since a precise prediction is no good unless the data are equally precise with regard to the quantities of interest.

    Various realms of cognitive psychology operate this way—though I admit they are not the most popular realms—especially certain segments of psychophysics, category learning, and memory. The Systems Factorial Technology framework by James Townsend and colleagues exemplifies this approach. Phil Smith and Dan Little have a recent article that discusses the value of theory in the context of the replication crisis, arguing that “power” comes not from large numbers of subjects (which now seems to be the fad in behavioral economics) but from a tight link between theory, data, and measurement as described above.

    • Andrew says:


      Yes. Design, including measurement. One of the problems is that when statisticians and quantitative researchers about “design,” they typically focus entirely on random sampling, randomized experiments, and causal identification, with nothing at all on measurement. And then researchers get the idea that if they have a randomized experiment, that any sorts of measurements are OK.

      • Actually, Fisher wrote late in his career, that he missed the real priority of design being the lessening dependence on assumptions rather than optimization. Less assumptions required for what the measurements reflect – seems to be in that category.

        • gec says:

          It’s a funny thing about measurement—sometimes, I think of it as a property of experimental design but sometimes I think about it in the context of modeling/theory. Obviously the outcome variables we select and the way we collect them are aspects of experimental design, but the logical connection between those variables and the theoretical constructs of the model seem like they should be considered under the heading of “theory”. That is, a theory should specify exactly how you get from its constructs (like intelligence, memory strength, mass) to measured outcome variables (like IQ scores, response times, or weight) and this requires specifying how the measurement actually works as part of the theory.

  4. Peter Dorman says:

    OK, this is a moment for an intemperate comment. It’s appalling to me that anyone who works with any wrinkle of utility theory would opine about what a good theory is. The models I’m familiar with are all epiphenomenal on expected utility maximization. Some formulation of U-max is the starting point, and then various mechanisms are grafted on to better predict behavioral “anomalies”. And yes, if you are constantly adding new such mechanisms or adjusting existing ones you can “explain” observed behavior, at least until your next empirical context comes along.

    This does not make U-max a good starting point for theorizing.

    One of the essential characteristics of a good theory in any discipline is its consistency with what is known or well supported in other fields that take aspects of that theory as a subject matter. Good chemistry can’t contradict physics. Good biology can’t contradict chemistry. Good theories of economic behavior can’t contradict psychology. Correct me if I’m wrong, but while U-max may have a certain as-if currency in some branches of psychology, the utility frame as understood by economists (a universal preference mapping) has no status at all in psychology as a basis for understanding human or other animal behavior. I think Gigerenzer has a better take on economic psychology than any anomalies-in-utility-maximization theorist I’ve encountered. The time will come when welfare economics in its entirety will be viewed as a giant embarrassment; of course it may be a ways off.

    Now this does not address the main topic of the post, so I apologize. Back to regularly scheduled programming.

    • > Good theories of economic behavior can’t contradict psychology

      I disagree. A good theory of psychological origins of what humans actually do can’t contradict psychology, but a prescriptive theory of economic behavior (a theory of what people SHOULD do) can be perfectly good even if it doesn’t describe what people actually do.

      Lots of what people do is *terrible* decision making, and if it were explained to them carefully they’d admit that they made terrible decisions and would prefer to have made the better ones. So evidently what people actually do isn’t necessarily what they would want to do if they knew about better ways of making decisions.

      Utility theory is a perfectly good theory about how to make decisions in such a way that things come out reliably better than other ways of making decisions…. This doesn’t mean it has to try to describe how people actually make decisions. I’d agree it probably is a bad way to describe how people really do make decisions.

      • It’s a bit like saying Newtonian mechanics is a bad way to understand the world because it doesn’t correspond to what people’s intuition tells them about how objects behave…. If your goal is to describe people’s “folk physics” and what kinds of predictions they make about golf balls, then Newtonian mechanics isn’t necessarily great. But if your goal is to figure out where the golf ball actually goes… Newton’s laws do a good job.

        And yes, I realize that some people do see U-max as an economic theory that describes what people do… I’m just saying even if it does a bad job at that, it can still be a good theory of how one *should* make decisions.

        • gec says:

          Not to dig too much on Expected Utility, but I’d say it’s not a very useful way to even describe what people *should* do largely because it doesn’t say anything about where utilities come from.

          The existence of context effects makes clear that “utility” even within a single person is not a stable construct. That is, the relative order of expressed preferences within a set of options depends on what other options are available, so the only way to interpret this within EU is to say that the utilities are changing depending on the choices available.

          And, of course, utility depends on the goal, the classic problem that “optimality” is relative: the “utility” of eating sea urchin may be low relative to a peanut butter sandwich if your goal is to relieve hunger but might be high relative to the sandwich if your goal is to impress someone with your iron stomach. The point is that for EU to give an answer, you have to make recourse to a theory of preferences that EU doesn’t provide.

          So maybe an analogy is that EU is like Newtonian mechanics but where the masses of objects depend on the masses of other objects nearby in a non-straightforward way. Within well-specified problems, EU can be a reasonable way to look at things, but as a general theory of either how people *do* or even how people *should* behave, I don’t find it very helpful.

          • EU tells you a way you should behave if you can come up with a utility that you recognize represents stable preferences. There are lots of situations where this can be true if you work at it enough to elicit the appropriate preferences. For example, if you’re trying to build a series of 100 buildings in various locations in the presence of changing prices and availabilities of labor, concrete, steel, wood, and varying environmental conditions. Suppose you have 7 different blueprints available to you. How can you choose which one to construct at the particular site you need it today?

            That people’s preferences change with time, context, location, etc should be unsurprising. But aspects of these preferences are stable. There are lots of complicated issues in practical Newtonian mechanics as well. You don’t know what kinds of storms your building will be subject to for example… but you can still make estimates of say worst-case wind loads and snow loads and soforth.

            • gec says:

              But then it seems like all the work is being done by the theory of the utilities, so what is left to be explained by EU? That’s what I mean when I say I don’t find it helpful.

              • I don’t know that EU has to “explain” anything. I think it’s a potentially emperically decidable question as to whether people who routinely apply EU to making decisions wind up happier and “better off” after many decisions than people who use other techniques…

                EU is kind of like Newtonian mechanics in that it does a good job of solving everyday problems, even if there are conditions where it could break down.

              • gec says:

                Ah, I think we’re talking about different things, i.e., we have different utility functions.

                I assign utility to things that might help me explain how people in general do make decisions, and I don’t find EU to score that high on that function for the reasons I pointed out above (not zero, just that I think there are better options out there, like decision field theory, though that also requires being hooked up to a separate theory of utilities).

                But it is entirely reasonable to assign utility to ways of *framing* decisions in order to encourage people to make better decisions (or to increase their reported satisfaction), and there EU has the virtues of being simple and transparent while also being motivated by a set of “rational” arguments that most people can buy into without much trouble.

                And Newton gave us ways to compute expectations and do numerical optimization too, so maybe he can get partial credit for EU too?

  5. Bergenholtz says:

    Someone outside economics has written a paper on this very question:

    Furthermore, I can’t help but note that cultural evolution is a field that aims not just to build (mathematically grounded) theories but also explain why humans behave as they do – rather than just accepting it as something inherently (rational).

  6. Matt Skaggs says:

    Ambuehl asked:

    “[Is] there any formal treatment relating to the above intuitions, or any empirical evidence on what kind of research tends to be more or less likely to replicate[?]”

    …apparently the answer is “no,” or it would have appeared long ago in this blog. This is very similar to a question I posed on an earlier thread about different types of evidence, which was similarly ignored. Instead folks go off on a tangent about the example of a theory, Expected Utility Theory. Even the headline of the post is tangential to the question, which has nothing to do with paradigms.

    Lots of good topics come up in this blog, but they are never really discussed in any meaningful way, and this column is Exhibit A. The worst part is the way language is abused. “Theory” means nothing if every field defines it differently. “Model” means nothing, or everything, how do you vote?

    I would argue that the field of social psychology only exists because there is no “formal treatment relating to the above intuitions.” If there were, psychologists would be working much harder to answer much simpler questions.

  7. Michael G says:

    > ‘My conjecture is that all the statistical skulduggery that leads to non-replicable results is much harder to do in a theory-based’

    It seems many agree with this statement. In the first round of the SCORE prediction markets ( ), we asked for expectations of field specific replication rates.

    Economics had a much higher expected replication rate (58%), compared with psychology (45%), political science (46%) or marketing and management (36%). For context the overall replication rate was expected to 48%.

    Will be interesting to see how accurate our forecasters are.

    • Anoneuoid says:

      Anything less than 50% means actually collecting the data is a waste of resources. We would be better off if they flip a coin to decide.

      • Andrew says:


        The problem here is with the definition of “replication rate,” which (a) assumes that a study can be summarized by a single true-false claim, and that (b) a successful replication is that which gives a positive and statistical significant result. Thus the theoretical replication rate could be as low as 2.5%, which is pretty much what I’d expect to see for independent replications of various well-publicized studies by Bem, Kanazawa, etc.

        • Anoneuoid says:

          the theoretical replication rate could be as low as 2.5%

          I agree that the definition of “replicated” used is bad. However, I don’t follow how you got this. Are you assuming very low power for the replication studies?

          • Andrew says:


            If a study is pure noise (as I think is the case for the work of Bem, Kanazawa, and various other researchers with noisy data and heads-I-win, tails-you-lose theories) then the probability of a replication being in the expected direction and statistically significant is 2.5%. To have an expected replication rate of lower than 2.5% would require some sort of active seeking of error. In practice I’d expect to see replication rates ranging from 2.5% to 100%.

            The replication projects found replication rates in the 50% range, which is actually a pretty high number, but I think that’s because the replication studies had high sample sizes. In the limit of infinite sample sizes for the replication studies, we’d expect replication rates between 50% and 100%.

            One of the difficulties in discussing replication rates is the dependence on sample size and measurement precision. Also it doesn’t help when respected journals publish erroneous statements about replication rates (see here for the story of one such example).

            • Anoneuoid says:

              If a study is pure noise (as I think is the case for the work of Bem, Kanazawa, and various other researchers with noisy data and heads-I-win, tails-you-lose theories) then the probability of a replication being in the expected direction and statistically significant is 2.5%. To have an expected replication rate of lower than 2.5% would require some sort of active seeking of error. In practice I’d expect to see replication rates ranging from 2.5% to 100%.

              Bem is an exception because people really do believe “exactly zero ESP ability exists”, but other than that “pure noise” to me means there is still some small but meaningless (and probably unstable) correlation between the intervention and effect. Of course, you can still have systematic errors or just plain p-hacking (which is just another way to render the null model false).

              The replication projects found replication rates in the 50% range, which is actually a pretty high number, but I think that’s because the replication studies had high sample sizes. In the limit of infinite sample sizes for the replication studies, we’d expect replication rates between 50% and 100%.

              This sounds more like what I would assume. If the first study is always going to be statistically significant, and the replication study is powered near 100%, then ~50% of these studies will result in statistical significance in the same direction.

              • Andrew says:


                But when effect sizes are small and measurements are noisy, even apparently large sample sizes are useless. For example, a serious analysis of sex ratios (Kanazawa’s target of study) would require data on hundreds of thousands of babies. A sample size of 3000 (that’s what Kanazawa had) might sound like a lot, but for this purpose it’s not. That’s why I’d expect approx 2.5% chance of finding a statistically significant effect in the predicted direction from an independent preregistered replication of any of his papers on this topic.

              • Anoneuoid says:

                That’s why I’d expect approx 2.5% chance of finding a statistically significant effect in the predicted direction from an independent preregistered replication of any of his papers on this topic.

                So you expect a 2.5% replication rate for studies with near zero power to detect the supposed effect. At the other extreme, we expect a 50% replication rate for studies with near 100% power to detect the supposed effect. The actual result should fall somewhere in between depending on the expected standardized effect size and sample size (ie. power).

                Running an underpowered replication is a waste of time to begin with. You already have the expected effect size from the first study so I don’t see what excuse there could be?

Leave a Reply