Megan Higgs (statistician) and Anna Dreber (economist) on how to judge the success of a replication

The discussion started with this comment from Megan Higgs regarding a recent science replication initiative:

I [Higgs] was immediately curious about their criteria for declaring a study replicated. In a quick skim of the info in the google form, here it is:

In the survey of beliefs, you will be asked for (a) the probability that each of the 20 studies using each method will be successfully replicated (defined by the finding of a statistically significant result, defined by p < 0.05, in the same direction as in the original study in a meta-analysis of the three replications) and (b) the expected effect size for the replication.

Hmmm….Maybe a first step before going too far with this should be a deeper discussion of how to define “successfully replicated?”

Higgs had further discussion in comments with Olavo Amaral (one of the organizers of the project), and then I brought in Anna Dreber, another organizer, who wrote:

In the past when we have set up prediction markets on this binary outcome (effect in the same direction as original result and p<0.05 vs not) or effect sizes (for example in terms of Cohen's d), participants have shied away from the effect size ones and wanted to put their money in the binary ones. What do you think would be a better alternative to the binary one, if not effect sizes? The Brazilian team discussed the “95 percent prediction interval” approach (Patil, Peng and Leek, 2016) but I think that's even more problematic. Or what do you think?

I replied that I’m not sure the best thing to do. Maybe make a more complicated bet with a non-binary outcome? People could still bet on “replicate” vs “non-replicate” but the range of payoffs could be continuous? I think it’s worth thinking about this—maybe figuring something out and writing a theoretical paper on it—before doing the next replication study.

Dreber responded:

We just closed a new prediction market on effect sizes where we gave participants separate endowments from the binary markets – will see if that encouraged trading. Another thing we saw in the Social Science Replication Project is that various proposed replication indicators led to almost the same outcomes when we had pretty high power. For example, for the studies that did not replicate according to the binary “effect in the same direction as the original result and p<0.05", average relative effect sizes were around 0, whereas for studies that did replicate according to this criterion, average relative effect sizes were around 75%. "We" (this was something Wagenmakers worked on) had similar results with Bayesian mixture models, for example.

And then Higgs wrote:

I really haven’t thought about the market, betting, and crowd-sourcing aspects. I’m coming at this purely from a statistical inference perspective and based on my frustrations and worries around the misuse of statistical summaries (like p-values and point estimates) motivated by trying to make life simpler.

At the risk of preaching to the choir, I’ll just give a little more of my view on the concept of replication. The goal of replicating studies is not to result in a nice, clean binary outcome indicating whether the results match, or not. After a study is replicated (by this I just mean the design and analysis repeated), it is then a serious (and probably difficult in most cases) job to assess the consistency of the results between the first study and the second (taking into account things that changed). This checking of degree of consistency does not necessarily result in a clean binary outcome of either “successful” replication or “not successful” replication. We humans love to force dichotomization on things that are not inherently dichotomous — it brings comfort in terms of making things clearly tractable and simple, and we rarely acknowledge or discuss what we are losing in the process. I am very worried about the implications of tying great efforts, like yours, at promoting the practice of replicating studies to arbitrary p-value thresholds and signs of point estimates — as statisticians continue to argue for fundamental flaws in these overly-simplified approaches. Do we really want to adopt a structure that assumes the results of a repeated study match the first study, or not, with nothing in between? In reality, most situations will take effort to critically assess and come to a reasonable conclusion. I also think we should encourage healthy argument about the degree to which the results are consistent depending on their analysis and interpretation — and such arguement would be great for science. Okay — I’ll get off my soap box now, but just wanted to give you a better sense of where I’m coming from.

For your project, I definitely see why the binary distinction seems necessary and why having a well-defined and understood criterion (or set of criteria) is attractive. Maybe the simplest solution for now is simply not to attach the misleading labels of “successful replication” and “unsuccessful replication” to the criterion. We know the criteria you are proposing have serious faults related to assessing whether the results of a repeated study adequately match the results from an original study — and I think wording is incredibly important in this situation. I can get myself to the point where I see the fun in having people predict whether the p-value will be <0.05 and/or whether the sign of some point estimate will be the same. But, I see this more as a potentially interesting look into our misunderstandings and misuses of those criteria, assuming the overly-simplistic binary predictions could be shown against the backdrop of a more holistic and continuous approach to assess the degree to which the results of the two studies are consistent. So, I guess my proposed easy-fix solution is to change the wording. You are not having people predict whether a study will be “successfully replicated” or not, you are having them predict whether the p-values will fall on the same side of an arbitrary threshold and/or whether the point estimate will have the same sign. [emphasis added] This can provide interesting information into the behavior of humans relative to these “standard” criteria that have been used in the past, but will not provide information about how well researchers can predict “successful replication” in general. You might be able to come up with another phrase to capture these criteria? With different wording, you are not sending as strong of a message that you believe these are reasonable criteria to judge “successful replication.” I suspect my suggestion is not overly attractive, but trying to throw something out there that is definitely doable and gets around some of the potential negative unintended consequences that I’m worried about. I haven’t thought a lot about the issues with using a percentage summary for average relative effect sizes either, but would be interesting to look into more. In my opinion, the assessment should really be tied to practical implications of the work and an understanding of the practical meaning represented by different values of the parameter of interest — and this takes knowledge of the research context and potential future implications of the work.

On a related note, I’m not sure why getting consistent results is a “success” in the process of replicating a study — the goal is to repeat a study to gain more information, so it’s not clear to me what degree of consistency, or lack thereof, would be qualify as more success or more failure. It seems to me that any outcome is actually a success if the process of repeating the study is undertaken rigorously. I don’t mean to complicate things, but I think it is another something to think about in terms of the subtle messages being sent to researchers all over the world. I think this relates to Andrew’s blog post on replication from yesterday as well. We should try to set up an attitude of gaining knowledge through replicating studies, rather than a structure of success or failure that can seem like at attack on the original study.

I fear that I’m coming across in an annoying lecturing kind of way — it really just stems from years of frustrations working with researchers to try to get them to let go of arbitrary thresholds, etc. and seeing how hard it is for them to push back against how it is now built into the scientific culture in which they are trying to survive. I can vividly picture the meeting where I might discuss how I could work with a researcher to assess consistency of results between two similar studies, and the researcher pushes back with wanting to do something simpler like use your criteria and cites your project as the reason for doing so. This is the reality of what I see happening and my motivation for bringing it up.

A few weeks later, Dreber updated:

The blog post had an effect – the Brazilian team suggested that we rephrase the questions on the markets as “Will the replication of Study X obtain a statistically significant effect in the same direction as the original study?” – so we are avoiding the term “successfully replicated” and making it clear what we are asking for.

And Higgs replied:

I’m really glad it motivated some discussion and change in wording. Given how far they have already come with changing wording, I’m going to throw another suggestion out there. The term “statistically significant” suffers from many of the same issues as “successful replication.” Given our past discussions, I know “p-value < 0.05” is being used as the definition of “statistically significant”, but that is not (and should not be) a broadly accepted universal definition. It’s a slippery slope categorizing results as either significant or not significant (another false dichotomy) — and using a summary statistic that most statisticians agree should not be used in that way. So, my suggestion would be to go further and just replace “statistically significant” with what is meant by it -- “p-value < 0.05”, thus avoiding another big and related issue. So, here is my suggestion: "Will the replication of Study X obtain a [two-sided?] p-value of <0.05 and a point estimate with the same sign as in the original study?” In my mind, the goal is to make it very clear exactly what is being done — and then other people can judge it for what it’s worth without unnecessary jargon that generally has the effect of making things sound more meaningful than they are. I realize it doesn’t sound as attractive, but that’s the reality. This whole exercise feels a little like giving the ingredients for the icing on a cake without divulging the ingredients in the cake itself, but I do think it’s much better than the original version.

One great thing about blogs is we can have these thoughtful discussions. Let’s recall that the topic of replication in science is also full of bad-faith or ignorant Harvard-style arguments such as, “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.” I’ll keep pounding on that one because (a) it reminds us what we’re having to deal with here, and (b) it’s hard to figure out what is the replication rate, if we haven’t even defined what a successful replication is.

21 thoughts on “Megan Higgs (statistician) and Anna Dreber (economist) on how to judge the success of a replication

    • The most obvious problem with this analysis is thinking 80% chance for practice 3 pointer is charitable for Curry.
      Curry making 65% of his shots in the annual 3 point contest is a good starting point, but there are big differences.

      1. Curry’s 105 streak in practice was for corner 3s, which is the easiest of 3 pointers partly because the distance to the basket is shortest. In NBA games, corner 3’s go in at 43% while wing 3’s are 35% and straight-away 3’s are 39%.
      2. In 3 point contest, players move around different spots to shoot. This is connected to the first point, but even ignoring the wing and straight-away 3’s, if Curry had to alternate between shooting in the left and right corner in practice, his percentage would likely go down.
      3. 3 point contest is timed where players need to make 25 shots in a minute while moving around the arc and Curry has to grab the ball himself. For comparison, in the practice video, Curry is at ~20 shots/minute pace, doesn’t move around and gets the ball passed to him on his chest. Bottom line: Curry needs to shoot a ball every 2 seconds in the 3 point contest, while he takes 3 seconds per shot for the practice.

      So in short, I don’t think assuming Curry has true talent probability of 80% for shooting corner 3’s in the same spot in untimed practice is being conservative at all.
      As a matter of fact, if I were forced to choose between over/under for Curry shooting 80 out of 100 in a similar practice setting, I would take over.

      p.s. None of what I said is evidence for or against hot hand fallacy (fallacy).
      I simply disagree with the authors’ main assumption which is crucial for the analysis.

  1. To my mind this discussion is confusing two different issues:

    1) testing a previous claim made in the scientific literature
    2) understanding the phenomena that lead to the claim

    There’s ++value in refuting a claim by the same method by which it was made. OTOH, for many claims – himicanes, shark attacks and the correlation of traffic fatalities with 0.5% drops on the DOW, 5 minute happy interventions – there’s not much benefit to more sophisticated analyses.

    By all means, if the original work has a shred of believability to the fundamental premise, more sophisticated analysis is a great approach. But much of the junk being published isn’t worth the effort of subtle analysis. It needs to be refuted by the methods by which it was made.

  2. It is a nice story of the evolution of thinking and revision of this exercise, influenced by discussion and learning.
    But – and there is always a “but”

    It makes me wonder more now whether this is a useful exercise at all. Should we care whether the “replication study” achieves a p value < 0.05 and a point estimate of the same sign as the original study? I'm willing to go along with this, provided that the "replication study" in question is really identical to the original study. Since it never is, the question is how close is the replication study to the original? Given the myriad researcher degrees of freedom and file drawers, what we will have is an "attempted" replication study, and if it does not achieve the specified results, the question will then be whether it is a failed replication or whether there was some difference in the makeup of the participants, time of day/season of the study, etc etc. This makes me think that it will be fairly useless to think of this as a replication study at all.

    It seems to me that what we want is the following: Research X has found the result Y (with an effect size with a point estimate and some measure of uncertainty). We want to know if this effect holds up in another study or was it just a random finding. So, another research study attempts to match the setting of the first study – there are likely to be some differences, some perhaps important and others not so much. The second study comes up with an effect size and measure of uncertainty. It makes sense to compare the results of the two studies. But does it make sense to ask whether the sign of the effect is the same and whether the p value < 0.05? Perhaps that is part of what we want to compare, but we also want to compare the ways the new study differs from the first study. I don't see a good reason to elevate the point estimate and p value as more important than all the other ways in which the two studies differ. I would think we want a more comprehensive comparison of the two studies and a sense of how our initial understanding is changed as a result of the second study. I believe that is close to the original reaction of Higgs. While the new language is certainly better than the original, it makes me feel like the replication initiative is misguided.

    • “Since it never is…”

      If a study – more properly an experiment – can’t be replicated why is it being done at all? If an experiment can’t be replicated for all practical purposes it’s results have been refuted. They have no scientific value.

    • Well since this is in a prediction markets setting I think the goal/interpretation is a bit different than what you state. If prediction markets are able to predict signs and p-values < 0.05 really well then that's pretty interesting and gives us some information about how differences in design/analysis affect results. It does seem from the above conversations / changes that they are moving away from failed/successful replication distinctions and into something more descriptive. This isn't a study that will solve any debate on replications but will chip away at some questions like the predictability of consistent estimates.

  3. It seems to me that if you “repeat” an experiment, and then don’t get a “significant result in the same direction”, that you have not disproved the original. Your replication might, after all, be the one that is wrong (assuming that at least one of them must be). You cannot work with the data any further, since you cannot combine data that have different p-values, at least not without too many assumptions and using the unpleasant statistical properties of p-values.

    If the second one were actually a close replica of the first, and the actual data were reported, then you could combine them, no doubt weighting the data sets by their standard errors, and then you would have improved your knowledge of the experiment and its outcomes. That’s what we really want, or should want from the point of moving the science along.

    If the “repetition” differed in some aspects, then a Bayesian analysis of the two sets of data would make a lot of sense.

    Just to say “Yea, p .05, not replicated!”, gives us very little new information compared to what might really be available.

    • Oops, looks like the formatter ate some of my remarks, the bits with gt and lt signs. I’m not sure if it works, but I’ll retype them here with HTML entities:

      Just to say “Yea, p <.05, not replicated!”, or “Boo, p > .05, not replicated” gives us very little new information compared to what might really be available

    • The problem with the suggestion of a pooled estimate is that the first analysis may have arisen from publication bias or forking paths. This would make it inappropriate to simply weight by standard errors to arrive at a pooled effect estimate, and I don’t believe that there is a clear appropriate approach to pooling in this setting without making strong assumptions. Still, I agree that consistency metrics should be based on something meta-analytic in flavor. Maybe apply an Edlin factor to the original study and then pool?

      • Well, yes, that’s why you would need the actual data. I imagine that you wouldn’t want to try to reproduce someone’s exact garden of forking paths.

  4. > Should we care whether the “replication study” achieves a p value < 0.05 and a point estimate of the same sign as the original study?
    If the "we" are scientists the answer, is a definite no.

    If the "we" are folks wanting to run/participate in a prediction market, the answer is a definite yes.

    • The fact that P-values are so excellent for what is functionally gambling on a stock market kind of explains their popularity, in a kind of perverse way. Just enough predictability for some people to gain a monetary edge over others, but not enough to really have a solid understanding of any of the underlying processes.

      Maybe capitalism was the problem all along? :p

  5. The conclusion of the original study is based upon a p value and a determination of statistical significance, which are of dubious validity in epistemological terms and cause statisticians to bicker endlessly among themselves.

    The replication study must determine whether the effect still appears when using a different sample, which introduces additional uncertainty in rather murky ways about which statisticians bicker endlessly among themselves.

    The results of the replication are then compared to the results of the original study, but given the issues above, there is no epistemologically valid way to do this and statisticians bicker endlessly among themselves about whether it can be done at all.

    Have I accurately depicted the current state of affairs with replication? ;)

    • Interesting discussion. It seems hard to argue that replication isn’t useful to validate phenomena assessed by studies, but conceptually it’s just as fraught with ambiguity and risks as the underlying scientific claims, so you sort of have to pick a camp based on what you want to accomplish. When it comes to the congruence between the original work’s reliance on dichotomizing outcomes and the dichotomous framing of replication, one view seems more prescriptive, where we assume the replicate-or-not line of thinking derives from the same underlying cognitive tendency to dichotomize, like Higgs seems to suggest, but we try to discourage that tendency by using the replication discussion as a “teachable moment” (reminds me a little of Deborah Mayo’s thoughts on how some aspects of the statistical reform argument seem to further perpetuate fallacies). But another view is that it’s more conceptually coherent, given the original framework used, to treat it as replicated or not, as well as more efficient.

      Or, like Dale seems to be getting at above, we decide we want to prioritize reflection on the epistemic uncertainty driving why replication is an ambiguous concept in these discussions. But then we lose the crispness of being able to easily point to how many things don’t replicate, which presumably has some value in discussions of science reform.

  6. Higgs’s objection seems not to be that defining successful replication in terms of p-values is a poor methodological choice, but that, ethically, it sends the wrong message to researchers about what really matters. And she’s right, particularly given that the project’s design involves communicating with so many researchers.

    Where she’s wrong is in saying, “the goal is to repeat a study to gain more information.” Is it? Apparently not, because this project was clearly designed to test some hypothesis about prediction markets. For that purpose, the fairness or accuracy of defining a successful replication as p < .05 may be irrelevant, and not doing so could undermine the study. What if social scientists, after years of being trained to evaluate studies in terms of p-values, make their best predictions when the question is framed in terms of p-values? What if telling them that p < .05 is a successful replication motivates better predictions?

    So there's an ethical conflict that's not obvious on first or even second thought, but gives some of us an uneasy feeling anyway. What are the implications for trying to convince established researchers that the replication movement isn't just a bunch of petty second-guessing of others' research, when this project is literally studying how good we are at second-guessing others' research?

  7. Andrew’s previous post (https://statmodeling.stat.columbia.edu/2020/07/13/to-change-the-world-behavioral-intervention-research-will-need-to-get-serious-about-heterogeneity/) about a paper by Beth Tipton, Chris Bryan, and David Yeager discussing taking heterogeneous treatment effects more seriously may be of interest to people here. From the abstract:

    Rather, the variation in effect estimates across studies that defines the current replication crisis is to be expected, even in the absence of false positives, as long as heterogeneous effects are studied without a systematic approach to sampling.

  8. Why is everybody talking about replication p<0.05? The p-value is about the specific dataset, not about the underlying truth, and is as such not in itself a feature that makes sense to predict (and obviously it will depend on the sample size of the replication study, or is this assumed to be equal?).

    If we look at a confidence interval and say, these are all the parameter values consistent with the data, the collection of these parameter values from the original study will often allow for almost the full range of p-values in the replicated study to be consistent with the earlier p<0.05, at least if p was not much smaller, i.e. the confidence interval came close to zero (assuming of course that zero was the H0). It makes much more sense in my view to ask whether the estimated parameter value in the replication is in the prediction interval computed from the first study. One could be even more generous and say that replication is OK(ish), if the intersection of confidence intervals is non-empty (Is this really more generous? No time to check it mathematically right now… ultimately of course it may also depend on the precise model and test).

    A non-binary approach could be to put together a checklist of several such criteria (maybe also involving levels other than 0.05/0.95).

    Then of course there are also reasons to think that even in a well done replication the underlying distribution may not quite be precisely identical to the original, which opens another box of Pandora.

Leave a Reply to Michael J Cancel reply

Your email address will not be published. Required fields are marked *