“Using prediction markets to estimate the reproducibility of scientific research”

A reporter sent me this new paper by Anna Dreber, Thomas Pfeiffer, Johan Almenberg, Siri Isaksson, Brad Wilson, Yiling Chen, Brian Nosek, and Magnus Johannesson, which begins:

Concerns about a lack of reproducibility of statistically significant results have recently been raised in many fields, and it has been argued that this lack comes at substantial economic costs. We here report the results from prediction markets set up to quantify the reproducibility of 44 studies published in prominent psychology journals and replicated in the Reproducibility Project: Psychology. The prediction markets predict the outcomes of the replications well and outperform a survey of market participants’ individual forecasts. This shows that prediction markets are a promising tool for assessing the reproducibility of published scientific results. The prediction markets also allow us to estimate probabilities for the hypotheses being true at different testing stages, which provides valuable information regarding the temporal dynamics of scientific discovery. We find that the hypotheses being tested in psychology typically have low prior probabilities of being true (median, 9%) and that a “statistically significant” finding needs to be confirmed in a well-powered replication to have a high probability of being true.

I replied: I think the idea is interesting and I have a lot of respect for the research team. But I am not so happy with the framing of these hypotheses as being “true” or “false,” and I think that statements such as “the probability of being true” generally have no real meaning. Consider, for example, one of those notorious social priming studies such as the claim that giving elderly-related words causes people to walk more slowly. Or one of those silly so-called evolutionary psychology studies such as the claim that single women were more likely to support Obama for president during certain times of the month. Yes these claims are silly and were overhyped, but are they “false”? I think it’s pretty meaningless to even ask the question. Certainly the effects in question won’t be exactly zero; more to the point, the effects will vary by person and by scenario. It makes sense to talk about average effects and variation in effects and the probability of a successful replication (if the criteria for “success” are defined clearly and ahead of time), but “the probability the hypothesis is true”? I don’t think so.

In summary I am supportive of this project. I think it’s a good idea and I’m interested in seeing it go further. I think they could do better by moving away from a true/false or even a replicate/not-replicate attitude, and instead think more continuously about uncertainty and variation. I don’t think it would be hard for them to move away from formulations such as “the probability that the research hypothesis is true” into a more sensible framing.

P.S. Robin Hanson offers thoughtful comments. I’m impressed by what Hanson has to say, partly because they are interesting remarks (no surprise given that he’s been thinking hard about this topic for many years), but more because it would be so easy for him to just view this latest project as a vindication of his ideas. But instead of just celebrating his success (as I think I’d do in this situation), he looks at all this with a critical eye. I might disagree with Robin about John Poindexter, but he (Robin) does good here.

15 thoughts on ““Using prediction markets to estimate the reproducibility of scientific research”

  1. This makes a lot of sense. But when one writes a paper, the point of the paper usually is that research hypothesis X is true. How should one frame the paper given your comment above?

    • Shravan:

      Hmmm, let’s consider an example such as the claimed correlation between voting, marital status, and a woman’s time of the month. Instead of looking for a nugget of truth, I think researchers would be better off trying to study the pattern of variation.

      In practice, a key step is to understand what can actually be learned from data, and to do that we need to step away from the attitude by which statistical significance = discovery. That’s one reason I think that discussions of “power” miss the point, and also why I’m not so happy with the idea that statistical significance is the measure of a successful replication.

      • I agree with everything here: I am perfectly happy to step away from looking at statistical significance (that is, if reviewers and editors will let me do that), and issues of power also fall by the wayside then. However, what does it exactly mean to to say researchers would be better of trying to study the pattern of variation? In the example you give, I could display the range of variation in different ways, through the estimates of a statistical model or by visualizing the raw data in different ways (or both). What would I have to say about the research question? I think you are saying that the research question itself is irrelevant—I can understand that in this example. But there are other situations where the average effect could be important and have theoretical or policy implications. For example, if the government wants to decide whether to fund depression treatments with medication or counselling, we could end up wanting to find out if treatment X is better than Y on average (I admit that understanding the variability here will be very important, so maybe this is not the best example—maybe this example actually illustrates your point). In psych studies (controlled experiments), we are often comparing condition a with b, and the goal is to find out if, on average, a is different from b in the response. In psycholinguistics, for example, people will look at subject vs object relatives to find out if one is harder to process than another on average (different theories make different predictions about the exact locus of difficulty in the sentence). There’s a lot of variability, but that’s just noise from the perspective of the theoretical claims being investigated. What would be gained by focusing on the variation there, if the question is evaluating the theoretical claims?

        I wonder what you think of an intermediate position: the average behavior is worth investigating, but not without also taking a close look at the pattern of variability. In other words, one can accept that in some situations the mean behavior is of interest, but the variability is always of interest too, it’s not just noise to be ignored. So, in your voting example, both things (what the authors wanted to do, and what you want) have equal status.

    • Shravan:

      What about moving to quantitative claims? i.e. “Giving elderly-related words causes people to walk x % slower than controls”.

      What the prediction markets bets on then is not a Yes vs No but what percent is “x”.

      • Rahul:

        Yes, I think that would be better. The next step then is for people to argue over the conditions of the experiment, but that’s already a step forward.

        I don’t really think that betting has to be involved at all—to me, betting is just one more layer of overhead—but I know some people like bets, and if this is a way of involving such people, fine.

        • One of the arguments in the paper is that the predictions made in the betting market were better/more accurate than predictions made (by the same subjects before the market opened) in a survey. There’s more overhead involved, but if betting makes the information better, then it seems worthwhile.

      • I’m under the impression that most prediction markets aren’t all that liquid, and binary options work ok in such an environment… If I understand this right, to figure out market views on ‘what percent is x’ you’d need options with a series of strikes (and probably separate puts and calls) which, I suspect, would need a lot more liquidity to function as a market.

  2. The bigger problem seems to be that people are focused on modeling/explaining average behavior, whereas they should be focused on modeling/explaining the (sources of) variability. As Spiegelhalter et al say in the Norm Chronicles, the average is an abstraction, the reality is variation. That’s about the modeling; my first question above about what the research paper is going to be about still remains…

  3. >”Our criterion for a successful replication was a replication result, with a P value of less than 0.05, in the same direction as the original result.”

    How do they get from Pr(Observation or More Extreme | Null Hypothesis) to whether the research hypothesis is true? This seems to assume there are no other explanations for a deviation from the null hypothesis… they should have just stayed away from all this “true hypothesis” terminology and talked about replicating the effect for whatever reason. Just looking at some of these in table S4, I know there will be questions about how exactly they measured these things:

    >”Men who feel threatened in their faith of the political and economic climate of their country will show a greater
    romantic interest in women who are portrayed as embodying benevolent sexist ideals than in women who are portrayed
    as career oriented, party seeking, active in social cases, or athletic.”

    • Shravan, Anon:

      The good news is that I think the researchers in this replication project are moving away from null hypothesis significance testing, and I am hoping that in future iterations of this work they will grapple more seriously with questions such as, What is a replication.

      • The goal here is made quite clear:

        >”Apart from rigorous replication of published studies, which is
        often perceived as unattractive and therefore rarely done, there
        are no formal mechanisms to identify irreproducible findings.
        Thus, it is typically left to the judgment of individual researchers
        to assess the credibility of published results. Prediction markets
        are a promising tool to fill this gap, because they can aggregate
        private information on reproducibility, and can generate and disseminate
        a consensus among market participants.”

        They want to use collective opinion as a proxy for independent replication and testing of precise predictions. This is continuation of the status quo:

        >”The probability of a tested hypothesis being true, also referred
        to as the positive predictive value or PPV (4), can however
        be estimated from the market price (Fig. 2). Using information
        about the power and significance levels…”

        I do not understand from the description in the supplement how they distinguish between multiple explanations for deviations from the null hypothesis.

      • Agree, “What is a replication?” is being treated very simplistically.

        In a meta-analysis context of randomized clinical trials I used to put this as one of the first tasks was discerning what _should_ be common, common in distribution (e.g. random effect) or arbitrarily different (e.g. variances) between studies. One of the next tasks would be how to assess the support for these (or failure) in a given set of data. Here one could focus on the likelihoods and specifically if they are _revising_ priors similarly over various studies.

        Clear definitive answers should not be expected…

  4. I’m not sure the paper does frame the hypotheses in terms of “true” and “false”, but rather “true” and “false positives”. Some statements certainly are true, and some apparently true statements which are ultimately discovered to be meaningless or untestable are false positives.

Leave a Reply to Patrick Caldon Cancel reply

Your email address will not be published. Required fields are marked *