Skip to content

This Friday at noon, join this online colloquium on replication and reproducibility, featuring experts in economics, statistics, and psychology!

Justin Esarey writes:

This Friday, October 27th at noon Eastern time, the International Methods Colloquium will host a roundtable discussion on the reproducibility crisis in social sciences and a recent proposal to impose a stricter threshold for statistical significance. The discussion is motivated by a paper, “Redefine statistical significance,” recently published in Nature Human Behavior (and available at
Our panelists are:
  1. Daniel Benjamin, Associate Research Professor of Economics at the University of Southern California and a primary co-author of the paper in Nature Human Behavior as well as many other articles on inference and hypothesis testing in the social sciences.
  2. Daniel Lakens, Assistant Professor in Applied Cognitive Psychology at Eindhoven University of Technology and an author or co-author on many articles on statistical inference in the social sciences, including the Open Science Collaboration’s recent Science publication “Estimating the reproducibility of psychological science” (available at
  3. Blake McShane, Associate Professor of Marketing at Northwestern University and a co-author of the recent paper “Abandon Statistical Significance” as well as many other articles on statistical inference and replicability.
  4. Jennifer Tackett, Associate Professor of Psychology at Northwestern University and a co-author of the recent paper “Abandon Statistical Significance” who specializes in childhood and adolescent psychopathology.
  5. E.J. Wagenmakers, Professor at the Methodology Unit of the Department of Psychology at the University of Amsterdam, a co-author of the paper in Nature Human Behavior and author or co-author of many other articles concerning statistical inference in the social sciences, including a meta-analysis of the “power pose” effect (available at
To tune in to the presentation and participate in the discussion after the talk, visit click “Watch Now!” on the day of the talk. To register for the talk in advance, click here:
The IMC uses Zoom, which is free to use for listeners and works on PCs, Macs, and iOS and Android tablets and phones. You can be a part of the talk from anywhere around the world with access to the Internet. The presentation and Q&A will last for a total of one hour.
This sounds great.


  1. Martha (Smith) says:

    “For fields where the threshold for defining statistical significance for new discoveries is P < 0.05, we propose a change to P < 0.005. This simple step would immediately improve the reproducibility of scientific research in many fields. Results that would currently be called significant but do not meet the new threshold should instead be called suggestive. …

    We restrict our recommendation to claims of discovery of new effects. We do not address the appropriate threshold for confirmatory or contradictory replications of existing claims."

    This criterion would probably "improve reproducibility" (at least, as defined by the new criterion), but leaves out a lot of important considerations (as often discussed on this blog). More is needed.

  2. Nick says:

    Just chiming in to say that Daniel Lakens is also first author on “Justify Your Alpha” (available here: – a response to “Redefine Statistical Significance”. Every other panel member had their p-value paper cited, just figured to have this one mentioned as well.

    • Thank you. I see what you mean. Lakens et al is the one I favor because frames wider queries than several others do.

    • Anoneuoid says:

      It’s interesting to compare these two sites:

      The math/physics-oriented one is no nonsense and information dense. It works fine on slower computers and even without javascript. It is self contained and doesn’t need to pull any data from other domains. The homepage is 90 KB.

      The psych/bio-oriented one seems to be very concerned about appearances. There is tons of empty space, some of the elements that are present are unnecessary (eg a header that follows your scroll), and data is loaded from 8 different domains. This page is laggy on slow computers and is unprepared for issues like lack of javascript (or if there is a problem with any of those other 8 domains). The homepage is 4848 KB, that is >50x greater than the arxiv page while containing much less information and less robustness.

      I don’t really know what to conclude from that, obviously there are some parallels that could be drawn to the practices of each field…

    • Anoneuoid says:

      Also, figure 1 seems like an interesting idea (empirical plot of p-value vs percent of studies replicated). The optimal p-value for replication seems to be 0.01-.015. Meanwhile the next lower range: .005-.01 is looking very untrustworthy.

      If that 95% CI is one that contains the true value that would be very strange (it goes from 0-15%). Are they 95% confident (I still don’t know what that means) in this result?

      Anyway, I’d like to see more of these types of plots. I have a theory that “alpha is the expected value of p”, ie the cutoff is chosen so that the average p-value will be near significant, given the typical resources of the field.

      The idea is that the nil null hypothesis is always false and alpha is determined by the sample size, measurement noise, typical effect size, etc that the field experiences.

      At the two extremes:
      High alpha -> cheap, noisy studies with typically small deviations from the null
      Low alpha -> expensive, precise studies with typically large deviations from the null

      In the simplest version, the “% of studies that replicate” (both p-values below alpha), will be the square of the “% that fall below the average p-value” (which equals alpha). To show you what I mean, here is a small simulation (I have no hope for these “pre” tags, but will try anyway):

      p = sapply(1:10000,
      a = rnorm(30, 0.0, 1);
      b = rnorm(30, .71, 1);
      t.test(a, b)$p.value}

      avgp = round(mean(p), 4)
      perBelow = round(mean(p < mean(p)), 4)
      perRep = round(perBelow^2, 4)

      hist(p, breaks = 40,
      main = paste0("avg p-val = ", avgp,
      "\n% below avg = ", 100*perBelow,
      "\n% replicate = ", 100*perRep))
      abline(v = avgp, lwd = 3, lty = 2)

      The results:

      I set it up so that the mean p-value is ~ 0.05, sample size is 30, and typical effect size is .71*sd. In this case ~75% of p-values are below the average (“significant”), and so ~60% of these should replicate.

      Also, we can see you need to run 4 studies like this to have 99.5% chance of at least one significant p-value: 1-(1-.75)^4 = 0.99.

      • Anoneuoid says:

        ~60% of these should replicate

        Not 60% of the already “significant” studies, but 60% of total studies. The former would just be the baserate ~75%. Replicate meaning two p-values under the significance level for these purposes.

  3. I would have included Sander Greenland’s view in letter to Nature.

  4. psyoskeptic says:

    I used to respect EJ. This just builds on the exaggerated CIs are terrible paper. Stop listening to Lakens and Morey!! Neither can handle the subtlety required to do the right thing.

    Perhaps to put it in Gelman’s terms, this proposal fails to embrace variability. And that’s probably the most important change that needs to be implemented going forward.

    Also, when discussing this please discuss the N needed now for a 3% difference in a binomial judgment where you collect one judgment across people (typical response probability around 0.5).

    • Andrew says:


      Your first paragraph is a bit harsh! All these writers seem open to discussion and considering alternative viewpoints. I disagree with some of their specifics but I do respect that people find these methods useful for some problems. Rather than “stop listening,” I’d recommend open and specific discussion, as for example here.

    • I’ve had many arguments with EJW [and Lakens, and Morey], but to say to stop listening to them seems wrong. I’ve had many disagreements with all of them, but I disagree that they lack the subtlety required to ‘do the right thing’. We all just disagree on what the ‘right’ thing is, or even where to apply the ‘right thing’. All sides in this debate need to be heard, and much of each person’s reasoning is sound in its own way. Some values differ, goals differ, considerations differ, etc.

      I disagree generally with heavy use of BFs (well, especially with ‘default’ BFs), so Morey, EJW, and I have argued quite a bit.
      I disagree generally with thresholds, so that has pitted me against Lakens and others pretty frequently.

      But they were still arguments worth having. Personally, I think the various papers all have their merits, and they all include subtleties.
      There were a lot of people on the Lakens paper who pretty well hate thresholds altogether as well [myself included]; others didn’t.

      E.g., the EJW/Benjamin paper in a nutshell: “If people gonna explore around and engage in bad practices, at least make it harder to do so, so that there’s /fewer/ false positives.” Critiques: “Meh, no. Kill those bad practices and deincentivize them. Not much evidence that p-values are the problem, per se.” “Meh, no; reinforces the use of dichotomous reasoning and will buttress the publication threshold problem.”
      The Lakens paper in a nutshell: “If you’re going to use thresholds, they should be justified in some way. A threshold for one field is not tenable for others. Use whatever method you want, but if you’re gonna set a threshold, justify it.” Critiques: “Kill all thresholds; use continuous reasoning and formal cost/benefit analyses.” “How do you actually justify a threshold, when the field has very few examples of anyone doing so?”
      Tackett/Gelman in a nutshell: “Stop using ‘significance’ altogether. There are so many other ways of supporting a claim and building knowledge. We could be rid of thresholds and be more mindful and get much further.” Critiques: “But that’s impractically hard for people who aren’t well-versed in stats.” I’m sure, others.

      I’m more in the Lakens/Tackett camp. Ideally, significance would be gone, and continuous logic should be used rather than some dichotomization that reinforces bad practices and goals. But practically – If people are gonna use thresholds, I want them to be used responsibly with an understanding of how QRPs and multiplicities will make threshold-crossing inevitable.

      There ARE subtleties in these things. There are also ideological vs practical battles in play, which drives some of these recommendations.

      Gotta listen to these people [and others]! Don’t shut them out because you disagree with them. I literally disagree with all the recommendations TO SOME DEGREE. You sorta have to realize these comments are short, short things – They don’t really have the word count to get particularly subtle.

    • Richard D. Morey says:

      I don’t know what you’re on about; I declined joining the .005 paper because I disagreed with the arguments re: p values.

    • EJ Wagenmakers says:

      Oh I will have to tease Richard with this…thanks for this
      Statistics boring? I don’t think so

  5. Well one consolation is that psychology seems to have inducted the best looking men. LOL JUST KIDDING I couldn’t resist.

  6. OOh! Just an FYI (may be worth an edit in the OP): Zoom has a linux client. When I see “PCs and macs” supported, that generally translates to “windows and macs”, but they do actually have a linux client.

    • Not sure which comment you’re referring to.

      If you’re replying to the one below – I’m just saying that if you’re on linux you can watch as well, it’s not limited to PCs (colloquially meaning windows), macs, or mobile OS’s. The video is shown through “zoom”, a conferencing program. I’m just saying that linux users (ubuntu/fedora/arch/suse/whatever) can watch too.

  7. Thank you. I figured as much. But I wanted to double check.

  8. Well, my paper on this topic has just been accepted by Royal Society Open Science. The final version is on bioRxiv

    It seems to me that (pace Gelman) the emphasis should be on the false positive risk, if only because a large proportion of users think, wrongly, that that’s what a p-value gives you.

    It is recommended that the terms “significant”and “non-significant”should never be used. Rather, P values and confidence interval should be supplemented by specifying the prior probability that would be needed to produce a specified (e.g. 5%) false positive risk.

    We have provided a web calculator which makes the calculations simple. It’s at

  9. Ayse Tezcan says:

    Unfortunately, I missed this talk but thankfully, Jonathan Eisen storified Richard Harris’ talk on this topic at UC Davis. I am posting link FYI

    It is quite encouraging to see this growing interest in and attention to improving quality of research in social and medical studies.

  10. Ayse Tezcan says:

    I am not sure if this was posted before but there will be 2017 Berkeley Initiative for
    Transparency in the Social Sciences Annual Meeting Dec 5-6 at Berkeley as well

  11. Elin says:

    Excellent session! They said the recording will be posted tonight sometime.

Leave a Reply