Skip to content
 

You are invited to join Replication Markets

Anna Dreber writes:

Replication Markets (RM) invites you to help us predict outcomes of 3,000 social and behavioral science experiments over the next year. We actively seek scholars with different voices and perspectives to create a wise and diverse crowd, and hope you will join us.

We invite you – your students, and any other interested parties – to join our crowdsourced prediction platform. By mid-2020 we will rate the replicability of claims from more than 60 academic journals. The claims were selected by an independent team that will also randomly choose about 200 for testing (replication).

• RM’s forecasters bet on the chance that a claim will replicate and may adjust their assessment after reading the original paper and discussing results with other players. Previous replication studies have demonstrated prediction accuracy of about 75% with these methods.

• RM’s findings will contribute to the wider body of scientific knowledge with a high-quality dataset of claim reliabilities, comparisons of several crowd aggregation methods, and insights about predicting replication. Anonymized data from RM will be open-sourced to train artificial intelligence models and speed future ratings of research claims.

• RM’s citizen scientists predict experimental results in a play-money market with real payouts totaling over $100K*. Payouts will be distributed among the most accurate of its anticipated 500 forecasters. There is no cost to play the Replication Markets.

Our project needs forecasters like you with knowledge, insight, and expertise in fields across the social and behavioral sciences. Please share this invitation with colleagues, students, and others who might be interested in participating.

Here’s the link to their homepage. And here’s how to sign up.

I know about Anna from this study from 2015 where she and her colleagues tried and failed to replicate a much publicized experiment from psychology (“The samples were collected in privacy, using passive drool procedures, and frozen immediately”), and then from a later study that she and some other colleagues did, using prediction markets to estimate the reproducibility of scientific research.

P.S. I do have some concerns regarding statements such as, “we will rate the replicability of claims from more than 60 academic journals.” I have no problem with the 50 journals; my concern is with the practice of declaring a replication a “success” or “failure.” And, yes, I know I just did this in the paragraph above! It’s a problem. We want to get definitive results, but definitive results are not always possible. A key issue here is the distinction between truth and evidence. We can say confidently that a particular study gives no good evidence for its claims, but that doesn’t mean those claims are false. Etc.

41 Comments

  1. Anonymous says:

    I never understood this prediction market stuff, and it only seems to be getting bigger and more influential.

    I reason it’s possibly a way towards inlfuencicng what to replicate and spend resources on. It’s essentially a way to “diret” science.

    If i understood things correctly, “The claims were selected by an independent team that will also randomly choose about 200 for testing (replication)”. It seems to me that the “predictability” (and thus the possible subsequent “conclusions” and “utility”) of all this “predication market” stuff can be higly dependent on the specific sample of studies chosen!?

    All in all, i reason this prediction market stuff is not in line with science as i reason 1) it in itself is unscientific (i.c. science is about actually replicating things, not guessing which findings replicate), and 2) it can be subsequently used to “steer” towards certain conclusions and proposals concerning the decision what stuff to replicate and where to spend resources on.

    In my reasoning, it’s just another thing that can be manipulated, and where “meta scientific” research is given way too much attention and influence. Just because you “showed” something via “meta scientific” research is NOT a reason to subsequently do something according to the conclusions of it. It looks all scientific, but it’s NOT in my view and reasoining. There are many things to consider concerning replication, and one if it could be that you let scientists themselves decide what they view as worthy of attention and resources.

    I just watched a lot of Chris D’elia (a comedian) material last night. In the words of Chris D’elia: “No “prediction markets”, just no. Why don’t you put on a vest with lots of pockets in it, and go take a hike.”

    • Anoneuoid says:

      In one of the papers about replication markets they come out and say it: Most researchers do not like replication experiments. So they are trying to substitute replications with something else.

      To me that is an admission these fields are now populated by people who do not like science…

      • Anonymous says:

        Quote from above: “So they are trying to substitute replications with something else.”

        Keep an eye out for other ways to substitute replications like another upcoming “collaborative” or “crowdsourced” project concerning finding an “algorithm” that predicts “replicabality” (or something like that). A project that (of course!) costs lost of money, and uses lots of resources, and gives attention and benefits to a small group of people, again.

        It’s basically the same stuff as this prediction market stuff, with a lot of the same possible probelmatic issues i mentioned above.

        I find it not surprising (therefore) that (at least some of) the same people seem to be involved with it…

        (Side note: i heard “transparency” is the new “hip” thing. In light of this, i am wondering more and more whether these large scale “collaborative” and “crowdsourced” efforts have ever been “open” and “transparent” concerning where all this money is exactly being spent on.

        Are these large scale “collaborative” projects, that we all should contribute to for some reason, also paying for the salaries of the people that come up with, and “manage”, all these “collaborative” efforts for instance? Are there any documents that list the exact money that was granted, and how this is being used? Is there a detailed overview somewhere concerning the exact allocation of parts of this money, including things like possible salaries, additional benefits like free lunches and standing desks, etc. in their “collaborative” institutions?)

        • Michael Bishop (@thatMikeBishop) says:

          We are striving to do be transparent, for example, we are pre-registering research hypotheses and will be creating a dataset which other researchers will be able to analyze.

          • Anonymous says:

            Quote from above: “We are striving to do be transparent”

            I specifically wondered about where all this money is exactly being spent on. Also with regards to the possible salaries of all these “managers” of “collaborative” efforts, and “directors” of institutions. And also with regards to the possiblity of universities, and other institutions, directly recieving parts of grants (for “reasons”).

            I think transparency concerning that stuff could be just as important currently (and in the near future) as pre-registration.

            Also see “Does Psychology have a conflict of interest problem” https: //www.nature.com/articles/d41586-019-02041-5

            • Most of the money for *most* grants goes to pay the salaries of the Principal Investigators and their staff… All government grants have an additional 40-60% more than the amount given to the PIs which is paid to the institution for “overhead” (basically supposed to be things like building maintenance, network maintenance, equipment maintenance, purchases of new equipment, paying salaries of administrators etc). The rate is negotiated on a per-institution basis (or for things like Univ. of CA for the whole UC system together).

              So, yes, grants are all about paying people’s salaries, buying them equipment to do research with, and paying their institution’s janitors, administrators, accountants, purchasing managers, and whatnot.

              • Anonymous says:

                Thank you so much for this information! I only recently came to know of the possibility of this all, and it is quite shocking to me.

                I don’t think it’s “fair” for tax-payers money for instance to be spend this way.

                I also think this can lead to researchers asking for way too much money because they (indirectly) benefit from it.

                The fact that this is all NOT discussed in all the recent “let’s improve sciene” and “let’s change the incentives” discussions seems highly strange, and highly problematic to me.

                Thank you again for your comment.

              • Well, if you want people to do research you will have to pay them. The question becomes how good is the quality of the portfolio of research being conducted? If you pay taxes to get high quality roads you are less likely to be upset than if for example the road maintenance companies buy themselves lots of fancy equipment and then complain “well, we didn’t have enough left over to actually fix any roads this year…” for example.

                It’s the same with science, if scientists are asking important questions and using reliable actually scientific methods then we might often say our money was well used. If instead we have lots of “fancy method X is better than our competitors stupid method Y of solving irrelevant problem Z that we solved just because our method X could solve it rather than because anyone cared about it” or “cancer X is caused by oncogene Y in useless poorly conceived study Z of badly misunderstood observational data p less than 0.05” then yeah, it’s not a good use of money.

                The question is how much of the various kinds of good vs crap are being done? My unfortunate conclusion is far too much crap.

              • Put another way, the problem with focusing on “replication” is that there’s plenty of “not even wrong” science. Science where whether the conclusion is correct or not is basically irrelevant to anything of any good purpose. Suppose that fatter armed male college students do have a certain voting preference or that women’s political opinions are influenced by monthly hormonal cycles, or that knocking out a certain fly gene increases fly life duration by 2.3% or whatever…. So what?

                So much of science is stamp collecting a bunch of transiently true facts with basically zero purpose… in 20 years fat armed college males will have different opinions, women who will have different life experiences and careers and different parental life experiences etc will have different political responses to hormonal cycles, and the genetic drift of laboratory fruit flies will have selected for a slightly different longevity… etc

                To the extent that science matters it should accumulate information that is stable and capable of consistently improving our understanding and prediction of outcomes in the world, and those predictions should be of interest, useful, improving the lives of humans in some way. Too much of science starts out failing to even ask an important question in the first place I guess.

                Here is the current issue of Nature: https://www.nature.com/nature/volumes/572/issues/7768

                Decide for yourself whether any of that research is dramatically improving our understanding of anything important.

                Consider that typically these articles cost somewhere between $100k and $1M to produce. A single issue of Nature is something like a $10M research enterprise… If you were in charge of $10M would you buy those findings?

      • Anonymous says:

        Quote from above: “Most researchers do not like replication experiments”

        What’s also possibly interesting, and/or useful, to think about is that projects like the “prediction market” may actually be “bad” for things like “replicability”, and perhaps more importantly for science and the scientific process. This is because in my view and reasoning, they are doing things “backwards”, and are part of putting the emphasis on “replicability” and not so much on “validity” and/or “doing good science”.

        Emphasizing the “replicability” part, but not the “how and why to actually do good science” part, seems problematic to me. Furthermore, “replicability” does not necessarily mean it’s “valid”. If you only talk about “replicability” you risk losing track of (other) crucial things in science: like “validity”, or theoretical stuff, or trying to find constructs and variables that better explain things, etc.

        This all is doing things “backwards” in my reasoning, and not tackling the source of things. You should in my view and resoning primarily be talking about how to do “good science” first. Then this “good science” should subsequently determine what to replicate, and build on.

        I can totally see how this all could contribute to having a small group of “hip” researchers doing “bad” science, and a “crowdsourced” group of “replicators” subsequently performing replications of all this “bad” work because the “predication markets” or an “algorithm” somehow marked it as being worthy of attention and resources (science!) It could basically contribute to steering science, and scientists, in a certain direction. I think that’s where all this stuff could lead to.

        • Anonymous says:

          Quote from above: “This all is doing things “backwards” in my reasoning, and not tackling the source of things. You should in my view and resoning primarily be talking about how to do “good science” first. Then this “good science” should subsequently determine what to replicate, and build on”

          I just read this in the “prediction market paper” from 2015: “Moreover, prediction markets could potentially be used as so-called “decision markets”(30, 31) to prioritize replication of some studies, such as those with the lowest likelihood of replication.”

          No, just no! This is what i mean about doing things “backwards”. This all doesn’t make much sense to me. Leaving aside the validity of the prediction market estimation of “replicability”, i reason that you shouldn’t be replicating stuff that has a low likelihood of replication?!

          You shouldn’t be verifying everything thet gets published in my view of science, especially when you are not primarily trying to prevent “bad” science from being published in the 1st place. Instead, i reason you should focus on, and perform, “good” science, and use it build theories, find more appropriate constructs and variables that are better at explaining stuff, etc.

          • Anonymous says:

            Quote from above: “I just read this in the “prediction market paper” from 2015: “Moreover, prediction markets could potentially be used as so-called “decision markets”(30, 31) to prioritize replication of some studies, such as those with the lowest likelihood of replication.”

            I can clearly remember a few years back, when “we” all were being told that Psychological Science may have “jumped to conclusions” in the past, and “should have been more careful with their conclusions and recommendations”.

            I am getting more and more annoyed with so-called “meta scientific” research that posits conclusions, or recommendations (that are sometimes even implemented) based on a few, and sometimes even just a single, paper or study.

            I reason “meta-scientific” research can be flawed, manipulated, fake, or whatever, just like “normal” research. I think it’s very important to be careful concerning conclusions that follow from this typf of research. Perhaps even more because these conclusions can influence certain things that could be implemented and could influence how science is being done!

            You’re not “special” in that regard dear meta-scientific research, you’re just not.

    • Anonymous, thanks for the feedback and questions. I hope this helps – it looks like we should have provided some more context in the invitation.

      The quick take: SCORE seeks to get reliability scores for published studies. Because studies lead to policy. A good candidate for reliability is chance of successful replication. In four previous studies, scientists using markets and surveys were about 70% accurate in predicting replication (markets a bit more, surveys a bit less). That’s valuable information. SCORE will test those methods again, at scale (our project and one other), and then see how much can be automated (Task Area 3, awards not yet announced that I know of). Could this be used to prioritize future replications: I sure hope so!

      Is forecasting scientific? Can be: see Tetlock’s Superforecasting work for some of the best.

      Replacing replications with forecasts? No, we like replications. My colleagues have been working with Center for Open Science to get science to do MORE replications. Forecasting chances can help 3 ways: (a) encourage replication by replacing yes/no thinking, (b) give info on current reliability, for decisions before replications appear, (c) give info to prioritize replications. But replications are expensive, so makes sense to prioritize. Each lab may have a different priority function, but most should consider probability in their value of information.

      [It’s not probability, it’s crowd woo-woo]: There’s an awfully large literature showing reliability of the methods in general, and four studies showing good results on this particular task. It’s not magic, and it’s not infallible, but there’s excellent reasons for thinking if there’s a signal out there, using the best ensemble techniques gives you the best forecast.

      Trying to substitute: no. See above.

      Algorithm to predict replicability: yes. As some of team will present at Metascience 2019, humble P-value already covers a lot of the information, and participants in previous markets didn’t fully use it. Our markets will use that as starting price.

      Costs a lot of money: yes, our project is about $660 per forecast right now in the high-risk research stage – mostly salaries. I estimate a full replication costs about $10K. So for the cost of 1 replication, we can do 15 forecasts (and far faster). That might let you choose a better replication. But my goal is to get costs below $100/forecast, so it’s attractive to journalists/publishers/policymakers, and about 6x cheaper per bit of information.

      Don’t understand/like markets: You can use other aggregation techniques to get good forecasts. See for example the recent paper by Dana, Atanasov, Tetlock and Mellers. (http://journal.sjdm.org/18/18919/jdm18919.html) In the team’s four previous studies on forecasting replications, the markets edged out the unweighted surveys, but not by much. We’ll be testing that again, and trying some better weighting schemes.

      Bad science: Um… the whole point of replication is to reduce that.

      • Anonymous says:

        Thanks for your extensive reply. I have a little trouble in digesting it all, as i feel it does not clearly tackle the issues i addressed. This is not helped by the introduction of new names and terms i am not familiar with like “COPE”, and “Metascience 2019”, and “SCORE”, and “superforecasting”. It all sounds very impressive though!

        I then started to read about “superforecasting” and read that a review said it was the most important book on decisoin making since “Thinking fast and slow” by Kahneman. I then stopped to try and find out what it is, as i think “Thinking fast and slow” is riddled with examples of, and conclusions based on, “bad science”.

        Perhaps i can start with where you ended: “Bad science: Um… the whole point of replication is to reduce that.”

        1) How does replication in general help to reduce “bad” science?
        2) How does the “predication market” project help to reduce “bad” science?

        • OK, but you really should look harder at some of the references. It’s a bit premature to stop reading Superforecasting because a reviewer compared it to a book that summarized many studies some of which we now know were wrong. Plenty of high-quality papers if that’s your preference.

          So, answers to your two questions, using “replication” as the attempt to replicate:

          1. Science is about testing one’s ideas. Not just once: some results are spurious. Replication reduces bad science by finding spurious results. No single replication is definitive, but if it’s done with a more powerful “telescope”, and fails to find anything, then the original result can’t be supported by the original study. (Uri Simonsohn’s analogy, hope I got it right.)

          2. So high-quality replications gives a strong signal about truth. But there are other signals: original sample size, presence or absence of pre-registration, wildness of claim, study design, analysis quality, conflict or consilience with other studies, results of previous replications, or even insider knowledge about the experiment. Each reader will use some of these. If we could combine them all, we’d have a better signal. We’re building that.

          • Anonymous says:

            Quotes from above:

            1) “Replication reduces bad science by finding spurious results.”

            No, it does not. At least not in my reasoning. You can possibly (begin to) spot “bad science” (and that’s even debatable) by doing replications, but i don’t see how replication (optimally) reduces “bad science”. This is part of my issue with this focus on “replicability” and projects like the “prediction market”: it reinforces/emphasizes “putting the cart before the horse” and/or is “a band-aid solution”.

            2) “But there are other signals: original sample size, presence or absence of pre-registration, wildness of claim, study design, analysis quality, conflict or consilience with other studies, results of previous replications, or even insider knowledge about the experiment. Each reader will use some of these. If we could combine them all, we’d have a better signal. We’re building that.”

            I am finding it very hard to know exactly what your project does (e.g. the site is terrible concerning giving information about what i would be doing if i would join). Please correct me if i misunderstood things in the following:

            First of all, a sentence like “If we could combine them all, we’d have a better signal” seems possible inaccurate to me.

            Secondly, if i uderstood thing correctly the “prediction market” lets participants “predict”/”invest in” findings on the basis of giving them JUST hypotheses of studies (and perhaps sometimes additionally an effect size or p-value). If this is correct, how can you be finding out which variables might be important concerning “replicability” or concerning the decisions of the participants? Are you asking participants what specific information they used to make their decisions? And if so, are participants even capable of giving an accurate answer to such a question?

    • Michael Bishop (@thatMikeBishop) says:

      Dear Anonymous, Past surveys and prediction markets have been successful (~70% accuracy) at identifying which studies will “successfully replicate.” Accurate forecasts alone are a valuable source of scientific knowledge, and accurate forecasts are only one of the things we hope to achieve. The Replication Markets team sees forecasting the result of replication studies as a valuable complement to completing scientific replications. Our work makes their work more valuable, and vice versa.

      • Anonymous says:

        Quote from above: “Past surveys and prediction markets have been successful (~70% accuracy) at identifying which studies will “successfully replicate.” Accurate forecasts alone are a valuable source of scientific knowledge, and accurate forecasts are only one of the things we hope to achieve. The Replication Markets team sees forecasting the result of replication studies as a valuable complement to completing scientific replications. Our work makes their work more valuable, and vice versa.”

        I don’t agree with (the gist, and/or implications, and/or conclusions) of this for a few reasons:

        1) I reason the 70% succesful identification of “succesfull replication” is a number that does NOT necessarily tell you anything about successfully using a prediction market on a different set of studies. These findings may be heavily dependent on the specific sample of studies chosen. Subsequently using a predication market as a forecast market concerning a whole different set of papers seems possibly problematic to me.

        2) I believe your focus of “replication” is on the “statistical significance” (and/or “direction of the effect”) which i think might be way less important than for instance the effect size. The entire project to me, including the hypotheses that seem to be the main input of the “forecasting”/”guessing”/whatever, emphasizes something that may not be very important at all!

        3) On a more philosophical (and perhaps ethical) note, i sincerely question whether it is “scientifically sound” to (try and) direct science in the way you are attempting to do. I do NOT think an algorithm, or prediction market, should guide scientists on which replication studies to perform. I think this should come from a combination of several things: like the speciic interest of the researcher, the strenght of previous studies, etc.

        • Anonymous:

          1) At this point you’re making general objections to a possible study you have imagined. Please read the actual previous market studies at (http://www.citationfuture.com). The 2018 is probably the best place to start.

          2) Agreed, effect size is more important, but there are all sorts of tradeoffs in running a large replication. I refer you to the Center for Open Science for a fascinating series of discussions on the methodology of replications, and how they have evolved theirs over time. I think you’ll be impressed.

          3) Happily I’m in no position to impose central planning. But each lab, researcher, or organization doing replications can benefit from including P(replicate) in their own decisions of what is most promising to pursue. (And “strength of previous studies” is exactly what we’re estimating.)

          • Anonymous says:

            Quotes from above:

            1) “1) At this point you’re making general objections to a possible study you have imagined. Please read the actual previous market studies at (http://www.citationfuture.com). The 2018 is probably the best place to start.”

            Ehm, if i am not mistaken i was quoting what someone else had written (?). I am not sure what i am possibly imagining. Is it the 70% number? If so, i quoted that number from someone else’s comment. More importantly though, i am making general remarks that have little to do with the specific percentage of succesfully predicted replications. Nor do i think i need to look at all the papers in detail for me to merely reason, and trying to make a point. A point you do not seem to address in your reply.

            2) “Agreed, effect size is more important, but there are all sorts of tradeoffs in running a large replication. I refer you to the Center for Open Science for a fascinating series of discussions on the methodology of replications, and how they have evolved theirs over time. I think you’ll be impressed.”

            So, if effect sizes could be more important from a scientific perspective, why are you focusing on the statistical significance level in replications? And how do you think your project influences (debates about) a focus on things like effect size? I think your project gives too much attention to “statistically significant” findings, and not effect sizes for instance. And i also think your project gives way too much (uncritical?) attention to “large replications”.

            I am not sure where to look for this “fascinating series of discussions on the methodology of replications” at the Center for Open Science. I can tell you that i think i will probably NOT be impressed by it. This is based on most things that specific institute has done thusfar.

            3) “3) Happily I’m in no position to impose central planning. But each lab, researcher, or organization doing replications can benefit from including P(replicate) in their own decisions of what is most promising to pursue. (And “strength of previous studies” is exactly what we’re estimating.)”

            It’s perhaps not just whether you are in the position to impose central planning, but also thinking about whether your project could lead to other people proposing or imposing such a central planning. If i was a betting man, and while we’re all talking about “prediction”, i would be betting that i have very, very little doubt that this “central planning” is exactly what will be proposed! All in the name of “collaboration”, and “crowdsourcing”, and “improving science”, and doing something about “the incentives” of course!

  2. Anonymous says:

    Quote from the blogpost: “We actively seek scholars with different voices and perspectives to create a wise and diverse crowd, and hope you will join us.”

    This also annoys me more and more. No, “different voices and perspectives” do not necessarily lead to “a wise” crowd.

    And of course, this “meta scientific” project again concerns “crowdsourcing”, and other ways to direct science and scientists. I am baffled more and more by how these “meta scientiic” researchers seem to find no problems in repeatedly asking lots of people to help them out, and using lots of resources, for projects that only they themselves thought of, and probably reap lots of benefits of. All in the name of “collaboration”, and “improving science” of course!

    I get the feeling “crowdsourcing” is being put forward as the new way to do science (as an aside: usually without much solid evidence or reasoning for proposing that), but i fear that it will lead to emphasizing, and replicating, many of the problematic issues that have plagued science in the last decades.

    “Crowdsourcing” is more like “crowding out” and “outsourcing” to me (see Binswanger, 2014 “Ecellence by nonsense: The competition for publications in modern science”). I fear this will all lead to a small group of people (attempting to be) directing things, while the large group of people will become more and more incapable, subservient, and unknowing of what actually goes on.

    • Anonymous says:

      Quote from above: “I fear this will all lead to a small group of people (attempting to be) directing things, while the large group of people will become more and more incapable, subservient, and unknowing of what actually goes on.”

      The latter part (“(…) while the large group of people will become more and more incapable, subservient, and unknowing of what actually goes on”) is something i have wondered about lately. Without wanting to be overly harsh, i reason (parts of) social science have been, and are currently being, occupied by (relatively) incapable, subservient, people that may not know what actually goes on.

      I have thought about this, and likened it to sports. In sports you can see who is the best. I beat you, i win the tournament, i become no. 1 in the world rakings, etc. That’s more clear than science. In science, this evaluation is not similar i reason. In science it’s hard to “objectively” determine who the “best” people are that should be in science/academia.

      Now if incapable people start to be involved in academia/science more and more, i think a process can occur that makes it so the overall quality of science and scientists gets worse over time. This is due to the incapable people picking out their PhD students, a less and less critical environment, a higher change of fundamental scientific values and principles not being acknowledged, etc.

      I think it’s basically possibly similar to the gist of the paper by & McElreath (2016) “The natural selection of bad science”, but only with a shift in focus which is now on the characteristics of the people who actually are responsible for the “bad science”. Perhaps that could be a nice idea for a new paper: “The natural selection of bad scientists”.

      I think this is exactly what happened in (parts of) social science in the last decades. It may sound harsh, but i have said it here before: if some social scientists want to improve their science, perhaps the best thing for them to do is quit and go find another job.

    • Alex says:

      They could be referring to ‘wise’ in the ‘wisdom of the crowd’ sense https://en.wikipedia.org/wiki/Wisdom_of_the_crowd . In psychology research that I’m aware of, a group’s average estimate of something tends to be more accurate than any particular person’s estimate of that thing. But that is only/mostly true if the estimates from each person are independent; that is, the crowd should have ‘different voices and perspectives’. If the whole crowd has similar perspectives and thus similar estimates, the crowd is more likely to be biased away from accuracy.

      • gec says:

        Yes, this is exactly the principle at work here (going back to Galton’s famous example of guessing the weight of an Ox), and as you say it does work provided the crowd consists of sufficiently independent members. Indeed, it is sometimes possible for the group average to be better than even the best single individual!

        The problem is with detecting correlations between individuals, though, and most groups that have done this successfully (there was a big project from either DARPA or IARPA a few years ago on this that I believe is still going on) use a variety of “check” questions from outside the target domain in order to build a model of how correlated their individual crowdmembers are. But I don’t see anything like that in this Replication Market project—all the questions are about binary claims from scientific publications, which will make it hard to know whether correlations between people reflect genuine informed agreement or bias.

        Incidentally, another result from that big [D/I]ARPA (can’t remember which) project was that the best predictive performance came from just hiring a small number of experts, rather than relying on a crowd. While that might seem to be identical to the current grant/publication review system, it differs because the “prediction panel” are not actually working in the domains they are making predictions about and so have less incentive for bias (though I’m sure they are not immune).

      • Anonymous says:

        Thanks for the link!

        My statement still stands though, at least at this point in time, i reason: No, “different voices and perspectives” do not necessarily lead to “a wise” crowd.

        I tried to find information on the wikipedia page concerning the exact type of situations where the “crowd” are supposedly “wise”. I found the following at the beginning of the wikipedia page which led me to stop my search altogether:

        “A large group’s aggregated answers to questions involving quantity estimation, general world knowledge, and spatial reasoning has generally been found[dubious – discuss] to be as good as, but often superior to, the answer given by any of the individuals within the group.[citation needed]”

        Please note the “dubious – discuss” and “citation needed” parts…

        I highly doubt there is “wisdom of the crowd”, unless perhaps in very specific circumstances.

        I do think it’s a great idea, and/or term, to use when proposing lots of things that seem to align with (parts of) social science, and recent “collaborative” efforts though!

        • Alex says:

          The danger in using Wikipedia to provide a quick summary is that your reader might focus on the least important part. Just because Wiki doesn’t have a citation or asks for discussion in a particular place, doesn’t mean that citations and discussion don’t exist. If you read the rest of the article you would see some references to research; here’s a mathematical model on when wisdom of the crowd would apply https://psycnet.apa.org/fulltext/2014-03872-001.pdf . You could also just google it (or google scholar should you prefer).

          • Anonymous says:

            Quote from above: “Just because Wiki doesn’t have a citation or asks for discussion in a particular place, doesn’t mean that citations and discussion don’t exist.”

            Good point! I was just using the Wikipedia link that was provided, and explaining my experience in doing so.

            Aside from that, i tried to google but couldn’t find anything substantial. I am sure there are papers, or studies, or mathematical models, to be found. However, given all that might have gone wrong in academia/science in the past decades, i will probably take most (if not all) of those with a grain of salt.

            Regardless of any of the validity of this “wisdom of the crowd” thing, i would like to notice something. If i am not mistaken, i note that we are (again) not talking about how to do “good” science. We are not talking about why certain findings are replicable. We are not talking about the “validity” of findings. Instead were are talking about “replicability”, “crowds”, “guesses”, and i have now learned (indirectly) “diversity” as well.

            That’s what’s most annoying about all this stuff to me.

            • Michael Bishop (@thatMikeBishop) says:

              You are correct that “replicability” is a major focus, but the Replication Markets team believes our work will also shed light on research validity. Even if we didn’t explicitly address validity at all, and we will, wouldn’t learning about replicability necessarily affect your beliefs about research validity?

              • Anonymous says:

                Quote from above: “(…) wouldn’t learning about replicability necessarily affect your beliefs about research validity?”

                Ehm, no.

                At least not in my view, and useage, of the words “replicability” and “validity”. I view, and used, them in the same way as the distinction between “reliability” and “validity” when it comes to tests.

                Here is a quote i took from somehwere:

                “If the scale is reliable it tells you the same weight every time you step on it as long as your weight has not actually changed. However, if the scale is not working properly, this number may not be your actual weight. If that is the case, this is an example of a scale that is reliable, or consistent, but not valid. For the scale to be valid and reliable, not only does it need to tell you the same weight every time you step on the scale, but it also has to measure your actual weight.”

        • Anoneuoid says:

          This is well known wrt machine learning:

          In contrast to standard network design in which many networks are generated but only one is kept, ensemble averaging keeps the less satisfactory networks around, but with less weight.[2] The theory of ensemble averaging relies on two properties of artificial neural networks:[3]

          In any network, the bias can be reduced at the cost of increased variance
          In a group of networks, the variance can be reduced at no cost to bias

          Ensemble averaging creates a group of networks, each with low bias and high variance, then combines them to a new network with (hopefully) low bias and low variance. It is thus a resolution of the bias-variance dilemma.[4] The idea of combining experts has been traced back to Pierre-Simon Laplace.[5]

          https://en.wikipedia.org/wiki/Ensemble_averaging_(machine_learning)

        • [Anonymous] is right that a crowd is not necessarily wise: sometimes it’s a mob. But a great deal of the work on forecasting has been on establishing the conditions where an ensemble (of models) or a crowd (of people) is more likely to be wise. There’s good theory going back at least to Solomonoff’s universal predictor (1968), and good empirical results at least back to Galton. Thanks to [Alex] for good replies and links.

          [gec] is right about diversity being a proxy to reduce correlations. When aggregating surveys, you have to be clever to avoid double-counting the same evidence. If you and I make estimates based on the same NYT article, we’re highly correlated, and count closer to 1 estimate than 2. So you add check questions, or use other algorithms. Slightly different with markets, but same idea. Analogous to model ensembles, or sensor fusion: you choose uncorrelated signals, or model the correlations.

          The heuristic is “diversity trumps ability” for the same reason ensembles dominate machine learning and Galton’s crowd beat his experts (to his chagrin): in the situations we use these, no one has a complete view or correct model. But as long as all are attending to the signal, then the more different views/models you get, the more the signal adds and the noise cancels. Obviously there’s caveats. As you learn more, you upweight the successful people/models. Markets do this by giving winners more resources. Statistical methods use cleverness.

          So we want scientists from around the world. We also want statisticians: they will see things the psychologists won’t. We want some non-scientists for reality check. We want people across the political spectrum: it’s easier to see flaws in studies whose conclusions you dislike.

          [gec] is right that the best performance comes from a small number of experts. But it’s hard to pick these ahead of time. Consistent with decades of previous work, the IARPA ACE teams found that usual markers of expertise were poor predictors. But Tetlock and Mellers’ team made progress: see Mellers et al 2014 for a good analysis and structural equation model. As I recall, they found roles for IQ, domain knowledge, and certain cognitive traits, but strongly mediated by raw effort, which is very hard to predict. (https://journals.sagepub.com/doi/abs/10.1177/0956797614524255).

          So yes, we’re going for diversity. Politically I lean Vulcan (Infinite Diversity in Infinite Combinations) so I’m glad this seems to be the case, but I’m on contract to deliver good forecasts. Not know who will have the signal, we recruit widely, and use our aggregation methods to try to upweight those who perform best. (Alas, we won’t get much hard feedback until 2020 – but we have some very clever computer scientists working on proxy methods that we hope can leverage the results of previous replication markets.)

          • Anonymous says:

            Quote from above: “We want some non-scientists for reality check. We want people across the political spectrum: it’s easier to see flaws in studies whose conclusions you dislike. “

            If i am not mistaken, the “prediction” (at least in the 2015 “prediccation market” study) is based on a few sentences in the form of a hypothesis. How is it possible to “see flaws in studies” if the study itself is not even being evaluated?

            (Side note: i just checked your website. Am i uderstandin things correctly that you are asking people to firstly “find a study that interests you” before forecasting? Is this not introducing a major possible and unnessesary confound/variable?)

            • Anonymous says:

              Quote from above: “If i am not mistaken, the “prediction” (at least in the 2015 “prediccation market” study) is based on a few sentences in the form of a hypothesis”

              It just occured to me that if this is correct, even more money can be spared by this “prediction market” stuff. You could save even more resources by not only not actually replicating stuff, but by not even performing the initial studies anymore.

              If it is correct that the “prediction” market succesfully predicted “replicability” based on ONLY the hypotheses, why even perform an actual “original” study anymore! You can just formulate hypotheses all day long, and have a large “crowd” guess/determine whether it’s correct!

              Science!!

              • It’s quite incorrect to assume forecasts are based only on a few sentences.

                Previous work by our academic team (Dreber et al, see citationfutures.com for papers and pre-registrations), provided not only the full paper, but also a detailed synopsis. Scaling up to 3,000 claims and a whole year means our provided synopses will be scanter, but participants will have access to the full paper if they wish.

              • Anonymous says:

                Quote from above: “It’s quite incorrect to assume forecasts are based only on a few sentences. ‘

                Thank you for the comment!

                If i look at the 2015 Dreber et al. prediction market paper called “Using prediction markets to estimate the reproducibility of scientific research”, i can read the following 3 sentences in the “materials and methods” – section;

                1) “For each replication, the hypothesis of the original study was summarized by one of the authors of this paper and submitted to the replication team for comments and final approval.”

                2) “Before the prediction market, the participants filled out asurvey. For each study, participants were asked two questions. One was meant to capture their beliefs of reproducibility:“How likely do you think itis that this hypothesis will be replicated (on a scale from 0% to 100%)? Participants were also asked about their expertise in the area:“How well doyou know this topic? (not at all, slightly, moderately, very well, extremelywell).””

                3) “Predictions were made by buying and selling stocks on the hypotheses on an interface that highlighted the forecasting functionality of the market.”

                So, for me at least, it seems that ONLY the hypotheses were used to base decisions on.

                But if people could have used the full paper to base their decisions on, is that somehow incroporated in the study design and/or conclusions? It seems a super important possible confound/variable to take into account concerning the accuracy of “forecasting”/”prediction”!

                Were participants asked if they read the entire paper when making decicions? Were participants asked which information they used to make their decisions? Are participants even capable of accurately knowing which exact information they used to make their decisions?

              • Anonymous says:

                Quote from above: “But if people could have used the full paper to base their decisions on, is that somehow incroporated in the study design and/or conclusions? It seems a super important possible confound/variable to take into account concerning the accuracy of “forecasting”/”prediction”!”

                I guess that depends on the ultimate goal of things.

                If you want to try and find out which variables are related to “replicability”, those things would be crucial i would reason!

                However, if you (for instance) just want to gather and/or direct and/or control lots of people to “collectively” decide things in some way this all doesn’t matter.

                Perhaps it’s all about perspective…

              • Anonymous says:

                Quote from above: “However, if you (for instance) just want to gather and/or direct and/or control lots of people to “collectively” decide things in some way this all doesn’t matter.”

                If i understood things correctly concerning all this “prediction market” and “algorithm” stuff, it may not even be the case that things are being decided “collectively”.

                Perhaps only a few people, who predicted things the best, are given super much weight. In effect, only a few people would then be deciding things.

                What if i were to come up with my own algorithm, and started to participate in the “prediction market” by making up a few names or letting my friends sign up. I could use the data from my own algorithm and use it to try and “out-predict” everyone, including the algorithm that is currently being used! It would basically be a computer that plays as a person in the prediction thing, but perhaps that’s allowed (?)

                If only a few people could be given lost of weight in all the algorithms, i would then effectively control the whole thing. If i was so inclined (i am not), i could then start my attempt at world domination, or make lots of money, or do something else!

                Thank goodness i don’t like, and/or particpate in, all this “prediction market” stuff!

              • Anonymous says:

                Quote from above: “I guess that depends on the ultimate goal of things. If you want to try and find out which variables are related to “replicability”, those things would be crucial i would reason! However, if you (for instance) just want to gather and/or direct and/or control lots of people to “collectively” decide things in some way this all doesn’t matter. Perhaps it’s all about perspective…”

                The goal of science, as far as i understand it, is explaining and understanding things. This is in line with “wanting to try and find out which variables are related to “replicability”. I reason only gathering, and/or directing, and/or controling lots of people to “collectively” predict “replicability” is not in line with understanding and explaining things, and therefore is not scientific.

                I don’t know exactly what it is. I do think something like that fits very nicely with some “New World Order of Social Science” should someone want to try and set that up! It’s the almost perfect way to “crowdsource” lots of people to get them to do what you want, and to direct and control things, and to not really make clear why and how something is happening.

                You would just be asking lots and lots of people for their prediction concerning “replicability”. You wouldn’t really know what information the people used though, and you wouldn’t have learned anything whatsoever about the variables that might be related to explianing and understanding “replicability”. However, that all doesn’t matter because you could mostly not really be talking about science, and how to do it, anyway! Instead you could mainly be talking about, and repeatedly hearing, and yelling, things like “incentives” and “open science” and “crowdsourcing” and “inclusion” and stuff like that.

                (Side note: It has already been shown that repeatedly hearing, and yelling, these type words may cause (some) scientists to not really think about science, and scientific things that much anymore. You could probably say you are about “open science” and “transparency” and do the exact opposite, and not many folks will even spot it, let alone bring it to your attention (e.g. see Hardwicke & Ioannidis “Mapping the universe of Registered Reports” 2018)

              • Anonymous says:

                Quote from above: (Side note: It has already been shown that repeatedly hearing, and yelling, these type words may cause (some) scientists to not really think about science, and scientific things that much anymore. You could probably say you are about “open science” and “transparency” and do the exact opposite, and not many folks will even spot it, let alone bring it to your attention (e.g. see Hardwicke & Ioannidis “Mapping the universe of Registered Reports” 2018)

                I bet you could even talk about the importance of “replications’ in science, and then slowly make it about “replicability”, until finally you would be proposing and/or doing the exact opposite of DOING ACTUAL REPLICATIONS!

                I think you could definitely come up with such a project, if you were so inclined of course. If there was such a thing as “The New World Order of Social Science” for instance, i think they could definitely try and pull something like that off.

                They should just remember to yell and shout things like “diversity”, “incentives”, “collaboration”, etc. so people don’t really think about science, and scientific things.

  3. As a kind of reality check, it would be interesting to set up a similar replication market for findings in physics. There’s some crazy stuff in quantum physics and general relativity. I am skeptical that getting a diverse set of views is going to help identify validity/replicability among those findings.

  4. Thanatos Savehn says:

    Apparently I need to watch more tennis because I signed up for Replication Markets (RM) and managed to make a bad prediction about the predictability of the felt emotions of faceless tennis players:

    “Original Study: p-value = 0.0000000015, standardised effect size (r) = 0.961, sample size (n) = 15. “…we first used peak expressive reactions to winning and losing points in professional high-stakes tennis matches that typically evoke strong affective (emotional) reactions…”

    Did the replication find a statistically significant effect in the same direction as the original result? (i.e. images of bodies without faces reveal how tennis players feel)”

    I won’t post how the replication came out but anyway, it (the RM) is pretty fun so far (just preliminaries and practice runs) and I suspect many of you would find it enjoyable. But if you do, pray that I don’t wind up on your team.

Leave a Reply