Skip to content

Participate in Brazilian Reproducibility Initiative—even if you don’t live in South America!

Anna Dreber writes:

There’s a big reproducibility initiative in Brazil on biomedical research led by Olavo Amaral and others, which is an awesome project where they are replicating 60 studies in Brazilian biomedical research. We (as usual lots of collaborators) are having a prediction survey and prediction markets for these replications – would it be possible for you to post something on your blog about this to attract participants? I am guessing that some of your readers might be interested.

Here’s more about the project and here is how to sign up:

Sounds like fun.


  1. Megan Higgs says:

    I agree it sounds like fun! But, … I was immediately curious about their criteria for declaring a study replicated. In a quick skim of the info in the google form, here it is:

    “In the survey of beliefs, you will be asked for (a) the probability that each of the 20 studies using each method will be successfully replicated (defined by the finding of a statistically significant result, defined by p < 0.05, in the same direction as in the original study in a meta-analysis of the three replications) and (b) the expected effect size for the replication."

    Hmmm….Maybe a first step before going too far with this should be a deeper discussion of how to define "successfully replicated?"

    • Andrew says:


      I hadn’t read thru to the details . . . I agree that using p-values to define a successful replication is a bad idea. Maybe the people running this project can fix this before going any further.

      • Olavo says:

        Hi Megan & Andrew,
        Thanks for the thoughtful comment (and I’m one of the people behind the project)! We actually agree with you that the definition of replication as mere same-side significance is problematic, and for that reason our main replication outcome in the Brazilian Reproducibility Initiative is not that one, but rather a measure based on effect size – i.e. the original result being within the prediction interval defined by a meta-analysis of our 3 replications (see for further elaboration).
        That said, researchers in the life sciences are definitely not used to thinking much about effect sizes (see previous work by our own group on that at, for example), and most wouldn’t know what a meta-analytic prediction interval is. Thus, we decided to play with the crowd on the prediction project (it is about the wisdom of crowds, after all…) and go with a more standard definition for the markets.
        We are asking for effect size predictions as well in the survey, so this should give us an insight of whether people can use that kind of reasoning as well. That said, we are unsure whether including an effect size-based market is worth it: based on the Many Labs 2, we think it might be confusing for the participants (see, although we are still discussing that point (and we’re glad to hear you suggestions!).

        • Anoneuoid says:

          There is something funny about calling it “wisdom of crowds”, when the crowd can’t be expected to understand the concept of an effect size being within a stated interval.

          Also, these intervals should be defined based on practical and systematic error concerns, not statistical.

        • Megan says:

          Thanks for the additional explanation. I think the idea for the project is really interesting and important and has the potential for great publicity related to the importance of replicating studies and the fun we can have doing it! It looks like when your team is assessing whether results from a study were “replicated,” you intend to use a more holistic approach — and you have time to work on the approach. The crowd sourcing component of the project seems a different —

          I generally agree (unfortunately) that “researchers in the life sciences are definitely not used to thinking much about effect sizes and most wouldn’t know what a meta-analytic prediction interval is.” However, I think there is danger in going with the more “standard definition for the markets” when we know there are serious flaws with that definition. It is sending a message to all those who don’t understand the details that it is fine to stick with the current standard and ignore buzzes they hear about its flaws — after all, the Brazilian Reproducibility Initiative is still basing their predictions on it! I’m not sure how the effect size prediction questions will be phrased in the survey and don’t doubt it can cause confusion — but the fact that there is so much confusion about this topic is important in its own right. Maybe it means we’re not quite ready to do such a crowd sourcing initiative without more education of participants first? I’m having a hard time wrapping my head around how useful the results will actually be given the currently proposed criteria. The usefulness of learning about the ability of researchers to predict successful replication definitely depends on the appropriateness of the criteria for success and a participant’s understanding of that criteria. It seems that you are focused on the point that the results won’t be useful if the criteria isn’t understood (so understood, but flawed, is better than more complicated and less flawed?). Either way — it seems like the risk of not getting something meaningful out of this part of the project is real. Could there be an education component for participants first — to train them about a set of criteria that better informed researchers feel comfortable getting behind? This is a great opportunity to raise awareness of this issue and educate on the topic! Thanks so much for sharing broadly and opening up conversation.

          • Olavo says:

            Hi Megan,
            Thanks a lot for engaging in the discussion. I agree with the argument that the message sent by the project is important – although you may be overstating our visibility a bit, especially as we are aiming for somewhere between 100 and 200 participants in the markets. :)
            And yes, the reasoning behind our choice has mostly been that being clearly understood by the participants is more important than the exact definition of replication used – especially considering that no definition will capture all the relevant aspects of a replication study.
            That said, although we agree that there are serious flaws in the way people treat statistical significance as a binary outcome, we may disagree in how flawed p values actually are. I wouldn’t go so far as saying that demonstrating that a result would be unlikely if the true effect size was zero (at p <0.05 or whatever threshold you choose) is useless.
            Although I get the philosophical point that no effect is exactly zero, and that, at the limit, the null hypothesis is a bit silly, by looking back at the failed replications in the psychology initiative, for example, there are many effects that are indeed very close to zero (see Camerer et al., 2018,, for an example). There is also effect size inflation for the ones that replicate partially, but you could argue that those are two different problems, and that the first priority is to weed out the null (or very close to null) effects.
            Moreover, in looking at the biomedical literature, people do use statistical significance as a binary outcome, with little consideration for effect sizes. Although I agree that this approach is very far from ideal, this means that most conclusions in biomedical papers state that effects are different from zero at an alpha of 0.05, without claims on effect size. In this sense, achieving inferential reproducibility (i.e. reproducibility of a paper’s conclusions, as per Goodman et al., 2016, ultimately means verifying that these claims are accurate, rather than looking at whether effect sizes are similar.
            I still think that replicating the effect size is important – perhaps more so than finding an effect that is significantly different from 0 at a given threshold – but I do think that the two measures add complementary information to each other, and that neither one is fatally flawed or devoid of meaning. Personally, I am more bothered by arbitrary thresholds than by p values themselves – that said, for the markets to work, we can’t avoid but choosing a threshold (and it made sense to go with the one that is used in all of the original publications).
            As I mentioned above, we are (a) using effect size measures as our primary outcome in the replications, (b) asking for effect size predictions in the survey, and (c) still discussing whether to include an effect size market. That said, previous experiences of our collaborators with continuous effect size markets suggest that people have a hard time understanding them – with the main evidence for that being that market prices are very poorly correlated with survey opinions in them. Thus, although these comments have heated up the ongoing discussion among us, they are still controversial at the moment.
            Thanks for the comments again, and it’s great to receive such thoughtful feedback. Ultimately, having to articulate our decisions is quite helpful for us (and it’s great to be questioning ourselves about them).

    • They could find an initial discussion of why not to use p_values to assess replication at the end here, along with some references Statistical significance gives bias a free pass

  2. Just wanted to say that I have worked with Clarissa Carneiro and Olavo Amaral on evaluating differences between preprints and published papers in the life sciences and they were professional and awesome and they totally deserve your support.

Leave a Reply