What should JPSP have done with Bem’s ESP paper, back in 2010? Click to find the surprisingly simple answer!

OK, you all remember the story, arguably the single event that sent the replication crisis into high gear: the decision of the Journal of Personality and Social Psychology to publish a paper on extra-sensory perception (ESP) by Cornell professor Daryl Bem and the subsequent discussion of this research in the New York Times and elsewhere.

The journal’s decision to publish that ESP paper was, arguably, a mistake. Or maybe not, as the publication led indirectly to valuable discussions that continue today.

But what should the editors have done? It’s a tough choice:

Option A. Publish an article when you’re pretty sure its theories and conclusions are completely wrong; or

Option B. Reject an article that has no obvious flaws (or, at least none that were noticed at the time; in retrospect the data analysis has big problems, see section 2.2 of this article for just one example).

Before going on, let me emphasize that top journals reject articles without obvious flaws all the time. A common reason for rejecting an article is that it’s not important enough. What about that article on ESP? Well, if its claims were correct, then it would be super-important. On the other hand, if there’s nothing there, it’s not important at all! So it’s hard to untangle the criteria of correctness and importance. Here I’m just pointing out that Option B is not so unreasonable: JPSP is allowed to reject a paper that makes big claims about ESP, just as it’s allowed to reject a paper that appears to be correct but is on a topic that they judge to be too specialized to be of general interest.

Anyway, to continue . . . the choice between options A and B is awkward: publish something you don’t really want to publish something, or decide to reject a paper largely on theoretical grounds.

But there’s a third choice. Option C. It’s a solution that just came to me (and I’m sure others have proposed it elsewhere), and it’s beautifully simple. I’ll get to it in a moment.

But first, why did JPSP publish such a ridiculous paper? Here are some good, or at least reasonable, motivations:

Fairness. Psychology journals routinely were publishing articles that were just as bad on other topics, so it doesn’t seem fair to reject Bem’s article just because its theory is implausible.

Open-mindedness; avoidance of censorship. The very implausibility of Bem’s theories could be taken as a reason for publishing his article: maybe it’s appropriate to bend over backward to give exposure to theories that we don’t really believe. The only trouble with this motivation is that there are so many implausible theories out there: if JPSP gives space to all of them, there will be no space left for mainstream psychology, what with all the articles about auras, ghosts, homeopathy, divine intervention, reincarnation, alien abductions, and so forth. Avoidance-of-censorship is an admirable principle, but in practice, some well-connected fringe theories seem to get special treatment. (Medical journals do, from time to time, publish articles on the effectiveness of intercessory prayer, which typically seem to get more publicity than their inevitable follow-up failed replications.)

What if it’s real? Stranger phenomena than ESP have been found in science. So another reason for publishing a paper such as Bem’s is that it’s possibly the scoop of the century. High-risk, high-reward.

Ok, now, here it is . . . what JPSP could have done:

Option C. Don’t publish Bem’s article. Publish his data. His raw data. Raw raw raw. All of it, along with a complete description of his data collection and experimental protocols, and enough computer code to allow outsiders to do whatever reanalyses they want. And then, if you must insist, you can also include Bem’s article as a speculative document to be included in the supplementary material.

My proposal—which JPSP could’ve done in 2010, had “just publish the raw data” been considered a live option at the time—flips the standard scheme of scientific publication. The usual way things go is to publish a highly polished article making strong conclusions, along with statistical arguments all pointing in the direction of said conclusions—basically, an expanded version of that five-paragraph essay you learned about in high school—and then, as an aside, some additional data summaries might appear in an online supplement. And, if you’re really lucky, the raw data are in some repository somewhere, but that almost never happens.

I’m saying the opposite: to the extent there’s news in a psychology experiment, the news comes from the design and data collection (which should be described in complete detail) and in the data. That’s what’s important. The analysis and the write-up, those are the afterthoughts. Given the data, anyone should be able to do the analysis.

Now apply this to Bem’s ESP research. The value, if any, in his experiments comes from the data. But that was the one thing that the journal didn’t publish! Instead they published pages and pages of speculations, funky theory, and selective data analysis.

Let’s go back and see how Option C fits in with JPSP’s motivations:

Fairness. Publishing Bem’s data is fair, and the journal could do the same for any other research projects that it deems to be of sufficient quality and importance.

Open-mindedness; avoidance of censorship. Again, what better argument can Bem offer the skeptics than his raw data? That’s the least censored thing possible.

What if it’s real? If it is, or could be, real, we want as many eyes on the data as possible. Who knows what could be learned. The very importance of the topic, which motivates publication, should also motivate full data sharing.

I like this solution. I guess the journal wouldn’t only publish the raw data and code; they’d also want to publish some basic analyses showing the key patterns in the data. But the focus is on the data, not the statistical analysis.

Option C is not a panacea and it is not intended to resolve all the problems of a scientific journal. In particular, they’s still have to decide what to publish, what to reject, and when to request revise-and-resubmit. The difference is in what gets published. Or, to be more precise, in what aspects of the publication are considered necessary and which are optional. For the Bem paper as published in JPSP, the writeup, the bold claims, and the statistically significant p-values were necessary; the data were optional. I’d switch that around. But it wouldn’t go that way for every project. Some projects have primary value in their data, for others, it’s the analysis or the theory that are most important.

In the example of the ESP study, if anything’s valuable it’s the data. Publishing the data would get the journal off the hook regarding fairness, open-mindedness, and not missing a scoop, while enabling others to move on reanalyses right away, and without saddling the journal with an embarrassing endorsement of a weak theory that, it turns out, was not really supported by data at all.

50 thoughts on “What should JPSP have done with Bem’s ESP paper, back in 2010? Click to find the surprisingly simple answer!

  1. You know Andrew, I have been wondering how in the heck such a ridiculous paper made such waves. It may be b/c there are folks who can point to individuals who they think have ESP. I’ve met many educated individuals here in DC who hold to this view. I may countenance the view that some are better able to predict some events than others. But there are plausibly good reasons for their ability to do so.

  2. Andrew uses the word “theory” in the above article many times and I wish he had used instead something like:

    hypothesis, thesis, conjecture, supposition, speculation, postulation, postulate, proposition, premise, surmise, assumption, presumption, presupposition, notion, guess, hunch, feeling, suspicion

    The word “theory” in ordinary language has the unfortunate aspect of being inherently inferior to the real stuff known as data. On the other hand, in science, THEORY is the highest accolade that can be bestowed, as in electromagnetic theory, theory of evolution, quantum theory, etc. THEORY roughly means explaining outcomes that have happened and outcomes that would happen if certain experiments would be performed.

    Conflating theory with THEORY results in pointless arguments with those who believe in ESP, literal interpretation of the Bible and Mexico will pay for the wall.

  3. Quote from the blog post above: “For the Bem paper as published in JPSP, the writeup, the bold claims, and the statistically significant p-values were necessary; the data were optional. I’d switch that around.”

    Let’s assume that Bem may have “p-hacked”, or manipulated some other things that got him the findings presented in his paper. Now, to get published in a fancy journal, i would reason that “switching things around” and requiring the raw data could simply lead to similar problematic issues.

    Let’s say JPSP “switched things around” and would now require publishing raw data, and i wanted to get published in JPSP. What’s stopping me from doing equivalent stuff (compared to p-hacking, selective reporting, etc.) to raw data?

    I’m not sure what is meant by “raw data”, but if it’s just a certain file/source with a lot of words and/or numbers in it that hasn’t been processed much, i would reason that could easily be manipulated and/or fabricated in some way or form.

    I only see 3 possible, partial (and probably representing an increasingly gloomy outlook) solutions to prevent Bem 2.0 in the future:

    1) get rid of all publishing barriers in psychology (like editors, peer-reviewers, journals) that possible facilitate, or even encourage, “bad” scientific practices because authors want to be published in journal X, and/or

    2) try and minimize the possible chances that researchers could possibly want to manipulate or fabricate data (e.g. money?, power?, politics?, etc.?), and/or

    3) simply don’t trust anything based on (raw) data that can (easily) be manipulated and/or fabricated.

    • Anon:

      If the data could be fabricated, that’s another story: I guess you’d want some combination of time stamps and sworn testimonials. Some journals do, for example, require authors to sign a form declaring no conflict of interest. There could be a similar form declaring that the data have not been altered. People could still cheat, but it would be harder to cheat if you’re required to sign something, as it would shut off one escape, which is the claim made by some researchers that they didn’t really know what was going on in their labs.

      • Quote from above: “I guess you’d want some combination of time stamps and sworn testimonials”

        Yes, that makes sense i think.

        This comment allows me to point to something that i never understood. It seems to me that a lot of “open science” and “let’s improve matters” people seem to highly value open data, but not so much a publically available pre-registration (with the combination of things like time stamps, testimonial, planned anylyses, etc.).

        In my reasoning, open data is (or will be) pretty useless without a publically available pre-registration with things like a time stamp, testimonial, planned analyses, etc..

        To illustrate my point: here is the data from Simmons, Nelson, and Simonsohn’s “False positie psychology” paper for instance: https://openpsychologydata.metajnl.com/articles/10.5334/jopd.aa/

        Now, if i understood things correctly, they could have published (parts of) their data with their “False positive psychology” -paper at the time (and maybe even got a nice “open science badge” for doing that!), and wrote in their papers that they all predicted their findings from the start. If they would have done that, would we now all (have to) “believe” their findings that listening to The Beatles “When i’m 64” makes you younger (or whatever their “finding” was) because it has “open data”?

        Even leaving aside the issues of how easy it could be to fabricate their “open data”, in my reasoning their open data would be pretty useless if they would not include a pre-registration that would show whether or not there has been flexible data-analysis and -collection.

      • I think that Andrew’s Option C assumes the present situation that people are more likely to trust data than analyses. This to some extent is because data are less likely to be fudged than are analyses. But if Option C were adopted, I would guess the people would become more likely to fudge data, so that time stamping and testimonials would be needed to help prevent that.
        But this suggests that perhaps some sort of testimonial regarding analysis should also be required.

    • I think some of us have enough of a mind to ask quite questions of about any data. Raw data, whether manufactured or manipulated or not. How many have access to the ‘raw data’ to begin with.

      I think that Rex Kline put it beautifully in one of his lectures. using NHST lends itself to ‘training incapacity’. It’s in his video Hello Statistics, a lecture on statistical reform proposals. However what does mean for the uses of confidence intervals, effect sizes, estimation, etc that Kline endorses and has elaborated in his own Introduction to the New Statistics: Estimation, Open Science, & Beyond.

      • Quote from above: “How many have access to the ‘raw data’ to begin with”

        If i were “Bem 2.0” i would try and make sure a very small group of people even had access to the raw data. If i would really aim high and want everyone to hear about (and believe) my ESP research, i could even try and set up some sort of “collaboration” with labs all over the world. They would all perform my ESP research, but i would make sure the data-collection would all go via a single computer: the one in my own lab! That way i could control a lot of things.

        Then i would make sure i myself, and/or a couple of my friends, would be the only ones who had access to the raw data. I could manipulate, and perhaps partially fabricate, a lot of things that way. Noboby would even find out, because all the labs across the world wouldn’t even have access to anything!

        They probably couldn’t even (easily) check whether their labs even collected data, so if i would say that their data was not included due to “technical difficulties”, they couldn’t even (easily) verify whether this is correct or not.

        It would also make it nearly impossible for skeptical researchers to replicate my ESP work, because of the thousands of participants i would use which they could probably never match. The ESP “findings”, and possible subsequent dozens of post-hoc analyses, based on the data-set coming from this “collaborative” project would basically be uncorrectable…

        The only difficulty would be to get all of these labs to join me, and do what i want. Perhaps it would help, to make this all happen, to throw in all kinds of “buzzwords”. Buzzwords like “crowdsourcing science”, and all that kind of thing. I could try and make sure everyone would know that they would be involved with something “collaborative”, and that they would be “changing the incentives”, and that they would “save resources” this way. I could even try and come up with a fancy name for my project…

        That’s what Bem 2.0 could say and write…

    • My hypothesis is that intentional scientific fraud is rare, but I don’t know the exact story of the Bem ESP paper.

      I think the more common situation is that people either don’t put in the effort up front in designing the experiment and/or iterate on the analysis until they produce “significant” results. They may not realize that the results of their analysis are flawed. Publishing raw data would help by placing more emphasis to the experiment design and allowing others to execute alternative analyses.

  4. I’ve been thinking about a similar idea for a journal that would be called “Hackable Science,” with many sub-journals (Hackable Science: Social Psychology, Hackable Science: Archaeology, Hackable Science: Ecology, etc.). It would be based on GitHub, Gitlab, Bitbucket, or similar. These platforms have already developed almost all the features that a journal would need, so that huge startup cost is avoided.

    The only requirement for submission would be that all data and code are included. There would not be any additional filtering at the publication stage. Once you submit your pull request, your article is published, as, let’s say, version 0.1.

    There would then be a review stage. To minimize the burden of running the journal, I currently envision that the authors would select an Editor for their article, who would then select 2 or 3 reviewers. Reviews would be handled by the “Issues” feature of GitHub, etc. The Editor would have to disclose their relationship to the authors. So the authors could pick, e.g., one of their advisors to be Editor, but she would have to disclose that. Authors who wanted their study to be taken seriously would try to get a prominent person in their field to serve as Editor, but one with whom they have little or no personal relationship. I also envision that the reviewers would not be anonymous. Since the article is already published, there is no reason to blame the reviewers for preventing your article from getting published. Non-anonymous editors and reviewers would hopefully not want their names attached to sloppy and inadequate reviews, so there would be an incentive to do a good job. Responses to reviews would be handled via issues and revisions to the ms and/or code, all via version control.

    Once (and if?) the Editor is satisfied that the authors have responded adequately to all reviews (all issues have been closed), she would certify a “release” at v. 1.0. However, further issues could be submitted by other readers at any time, and the authors could choose to respond to them. If they did, there could be future versions of the article (1.1, 1.2, …). GitHub is integrated with Zenodo, a repository archiving service that can provide DOI’s, including versioned DOI’s, making it easy to cite the article and its various versions. The “importance” of an article could be evaluated by the number of stars, and also by the reputation of the Editor, which would help other researchers to find the “good stuff.”

    Importantly, other researchers could fork the article if they wanted, creating their own new study using the same data. They could also go through a review process (or not). The mantra would be: Hack the science!

    • Ed:

      I like the idea of the author suggesting an editor who then chooses the reviewers. This would level the playing field, as everyone would get the same treatment as the friends of Robert Sternberg etc. in psychology. And, now that these luminaries would no longer have these favors to give out, it would reduce their power.

    • Ed, this is such a simple idea for a journal workflow that would get you about 75-80% of the way to a proper journal. All you might need on top is a membership with CLOCKSS for digital preservation and to fill in a form with your national library to get an ISSN and you’re in business.

      It’s also hilarious because of the millions poured into journal management systems by all the major publishing companies from Highwire and Clarivate to PLoS (which wasted tons of Aperta before scrapping it) to the open access managers like PKP’s OJS and CoKo’s forthcoming platform. Could have just used Github. Hahaha.

  5. The same problems arose in the “Memory of water” affair
    https://en.wikipedia.org/wiki/Water_memory

    Nature editors faced the same “dilemma” – claims that they felt were potentially important and that could not be invalidated on the basis of the paper’s contents (NB. that probably would also have been the case with the raw data). They published, under the condition that they could observe a replication, which didn’t work once things were blinded.

    I would argue that both papers should have been rejected on the grounds that they did not provide the slightest plausible mechanism for results that contradicted huge amounts of settled science. In other words: “extraordinary claims require extraordinary evidence not extraordinary publicity”. Of course, the data should be published anyway.

    • Boris:

      I agree that it would’ve been a reasonable decision for JPSP to have simply rejected the Bem article, and I thought it was a big mistake, even prospectively, for Statistical Science to have published that Bible Code paper back in 1994. As always, one big problem with these gestures of open-mindedness is that journals are open-minded toward certain sorts of groundless speculation but not others.

      In any case, my post above was all conditional on JPSP’s decision to not simply reject Bem’s paper. The editors of the journal had the option to reject the paper when it was submitted, and they chose not to do so. The point of my above post is that if, for whatever reason, you don’t want to reject such a paper, there are other alternatives than simply publishing it as is. I prefer the alternative of publishing the data without the conclusions—rather than what they actually did, which was to publish the conclusions without the data.

  6. Another potential “problem” with this approach is that it wouldn’t be considered fair. There would be two different tiers of published science: mainstream science that is allowed to speculate and hide data versus “fringe” science that is required to show data and not be allowed to tell a story or speculate. And the line between mainstream with hidden data versus fringe with public data would depend on biased editors and reviewers.

    I think that we’ve all seen enough questionable published science that most of us would want to solve any dichotomy with the requirement that all data is published (not just “accessible from the authors upon request”). But there are too many vested interests in hidden data and perceived costs to make that a reasonable goal in the near future.

    • Re: Mainstream vs fringe science

      This strikes me as a false dichotomy. There are other approaches that may be enlisted. I am a proponent of filing an exploratory pre-registration protocol, which would give audiences a franchise in seeing the data.

      Re: ‘Anonymous > ‘This comment allows me to point to something that i never understood. It seems to me that a lot of “open science” and “let’s improve matters” people seem to highly value open data, but not so much a publically available pre-registration (with the combination of things like timestamps, testimonial, planned analyses, etc.).

      Why do you suppose that it might be so?

      • Quote from above: “Why do you suppose that it might be so?”

        I have been thinking about this a lot. All i can think of is the following 3 explantions:

        1) i am not understanding things correctly,
        2) they are not understanding things correctly,
        3) they are knowingly emphasizing “open data” over “open pre-registration” in order to make it look like they are “improving things” even though it is not really a solution/improvement but only looks like one

  7. But what about all the data plundering through various different genres of pictures, & other stuff they didn’t use? I doubt they kept a record of of how many chances they gave themselves to find something, especially as Bem told them that’s just what they needed to do.

  8. I don’t know what the JPSP editors SHOULD have done, but in retrospect I am delighted that they published the article as they did. The response to it highlighted so many important flaws in the system that needed changing — and so promoted change. Issues Include (but not limited to):

    1. Bem’s small sample size / questionable analyses — but that he should be published because THAT’S WHAT EVERYONE ELSE WAS DOING.
    2. Publication bias. If they hadn’t published it, it would have been a form of publication bias where editors don’t publish things that disagree with prevailing thought even though they would have published a similar paper that agreed with it. *
    3. Some of the loudest empiricists stated that (mostly implicitly) that NO AMOUNT of evidence would ever convince them that ESP (or pre-cognition) exists. Question: If your priors for something are ZERO are you a scientist?

    * “Extraordinary claims require extraordinary evidence” debate to take place elsewhere.

    • Bobbie:

      I disagree with your point #2. As I wrote in my post above, maybe it’s appropriate to bend over backward to give exposure to theories that we don’t really believe. The only trouble with this motivation is that there are so many implausible theories out there: if JPSP gives space to all of them, there will be no space left for mainstream psychology, what with all the articles about auras, ghosts, homeopathy, divine intervention, reincarnation, alien abductions, and so forth. Avoidance-of-censorship is an admirable principle, but in practice, some well-connected fringe theories seem to get special treatment. Why is JPSP not publishing paper on ghosts? Only because JPSP published papers that are part of they psychology-professor club, and the psychology-professor club currently has room for papers on ESP, air rage, himmicanes, ages ending in 9, etc., but not ghosts.

      • Andrew,

        Although constituting a seemingly small percent, very educated social scientists &, yes, lawyers, doctors, and engineers are fascinated with these happenings you listed above. I myself was astounded. Why do you suppose X Files was so popular? It is an audience market, which draws from this small percent which believes, in different degrees/contexts, in the existence of ghosts, auras, aliens, ESP, etc. These ‘hypotheses may be implausible to you, but they constitute niches for different venues. As I recall there was a big todo over Area 51 back in early 2000.

        Ghosts though may be lower on the ‘believable’ index among the well-connected fringe. Therefore they don’t get special attention. LOL

        • Sameera:

          I did a very quick search and came across this news article which states that 45% of Americans say they believe in ghosts, 65% believe in the supernatural, and 65% believe in God. And I’m sure there’s an academic literature on ghosts. It just doesn’t get the respect of psychology professors. Himmicanes etc. are a sort of modern supernatural belief which have just enough support from popular psychology theories that they can get academic respectability.

        • Thanks for the article link. I was thinking of the subset with graduate-level education and in professions where might one expect such subsets to doubt the existence of Ghosts, Aliens, etc. I haven’t come across any psychologists who do. Then again I was surprised that doctors and lawyers may believe in the supernatural.

        • I fully support Bobbie’s point #2. If we don’t want science to end up as an echo chamber for the ideas and theories of “prominent academics,” it is vital that we remain open to publishing ideas which we ourselves view as ridiculous.

          It is precisely the role of statistical science to be the arbiter of what is acceptable to publish (important is another matter). If you do your methodology “right” and do your analysis “right” you shouldn’t get some naysayer with institutional power saying “well I just don’t believe it.”

          As a side note: We as statisticians don’t have a unified set of recommendations and solutions to the replication crisis, so the fault lies with the arbiters (i.e. us) not with the researchers.

        • As a long-time member of “arbiter” community (i.e. doing statistics every day to earn a living) one thing I’ve always preached is “You don’t know anything unless you know the denominator”. In a way, that’s the bedrock problem with all this “replicability” nastiness. Unless there’s some way to know everything that a researcher could have done with all the data he has, rather than just what he says he did with the data he chose to use then there is no way to vet his work except by pointing out obvious (at least in retrospect) howlers in his methods.

        • I’d still argue that that is a failure on our part. We require researchers to jump through hoops (e.g. present p-values, describe missingness, specify the statistical analysis), and the result is that we are unable to evaluate whether the work is good or not with the information we asked for. How ridiculous is that?

          I think we could fix this fairly easily while still allowing researchers full freedom to explore and craft their analysis in response to the data they get. We could simply require that the data be split into an exploration set and a validation set and blind the researcher to the validation set until submission for publication. Upon submission the blind is broken and the same analysis that was decided upon in the exploration set is run on the validation set (ideally by an independent party).

          Since multiple papers are often written on the same data, each paper should record the number of times the blind was broken previously. This would allow editors and readers to evaluate the potential for cross-contamination.

          The additional burden on researchers would be that they would have to accumulate <=2 times the sample size they were planning, which is a significant burden, but one I view as reasonable to solve such a central issue to the field.

        • Quotes from above: “We could simply require that the data be split into an exploration set and a validation set and blind the researcher to the validation set until submission for publication”

          &

          “Since multiple papers are often written on the same data, each paper should record the number of times the blind was broken previously. This would allow editors and readers to evaluate the potential for cross-contamination.

          (I think) this comment allows me to ask a question for the 2nd time, but received no answer to: https://statmodeling.stat.columbia.edu/2019/01/21/of-butterflies-and-piranhas/#comment-953583.

          If i am not mistaken, even randomly produced large data sets may contain many spurious “statistically significant” correlations. I can imagine many spurious significant correlations, and other type of statistical findings, are present in actually collected large data sets (also see https://projects.fivethirtyeight.com/p-hacking/).

          When you would have, let’s say, 10 hypotheses with associated statistical tests and then collect the large data set, you would probably make an adjustment for the p-values. But i’m not sure something like this happens, or is even possible, for later analysis of a pre-existing large data set.

          Also, if i understood thing correctly, a p-value of a certain statistical analysis is only “valid” if you decided upfront what the statistical analysis will be before you analyze the data. This also makes it easier to then make an adjustment for the p-values in the case of multiple analyses.

          If this is correct, i wonder how this all relates to analyzing pre-existing large data sets. I wonder if the asummptions of p-values and/or the statistical analysis could actually make p-values coming from analyses of large pre-existing data sets technically “invalid” (e.g. because you are not able to correct for multiple comparisons).

          For instance, i would reason that the larger, and more elaborate the data set is, the easier it is to find “statistically significant” findings that are simply spurious. Even when splitting the data in an “exploratory” and “validation” set.

          I am not sure if the following makes sense, but it is based on a hunch. Please correct me if this makes little sense:

          Isn’t examining large pre-existing data sets this way like p-hacking on steroids? I reason there will probably be no correction for multiple analyses, i think there is (perhaps even a higher) chance that people will (perhaps unconsciously) engage in HARK-ing, and due to the size of the data set i think the “exploration” set would probably be very similar to the “validation” set which means any spurious “finding” coming from the exploration set will have a higher chance of being “confirmed” in the validation set (compared to smaller data sets).

        • @Anonymous, yes you are correct that if you paper contains a bunch of tests, then some are likely to be significant even if none of the effects are present. Traditionally, you’d expect 5% of tests to be significant if there were no effects. And a reviewer or reader can read the paper and do their own bonferroni corrections (or whatever) and judge how much to trust any one result in the presence of many comparisons.

          What happens in practice is that not all tests are reported (e.g. a subgroup comparison would have been reported only if significant) and outcomes are defined to get maximal significance (e.g. defining a the time scale or degree of improvement in a way to maximally differentiate between control and treatment). This, which Andrew calls the garden of forking paths, is the heart of the replication crisis.

          Having a validation set that only gets unblinded at publication submission would remove the garden of forking paths. All tests run on the exploration set are run on the validation set and are reported in the paper. It ensures the internal validity of the statistical results that you report (i.e. the ones run on the validation set).

          Regarding some of your other points:

          1. From a technical perspective spurious significant tests will occur at the same rate in large and small datasets (5% of the time). That said, with large datasets you have the ability to detect small changes, so unimportant differences are often statistically significant.

          2. If you come to an existing dataset will a well thought out hypothesis that you wish to test and a reasonable statistical plan for doing so, then your “significance” results will be valid in the traditional sense. If you data mine an existing dataset looking for interesting relationships… not so much.

        • Thank you for the reply!

          I am not (yet) convinced though, which i hope to make clear with the following.

          1) You wrote: “Having a validation set that only gets unblinded at publication submission would remove the garden of forking paths”

          I don’t understand how this could ever work in practice. If you can explore all you want in the exploration set (e.g. 20 analyses), and then verfiy your findings with the validation set (e.g. 1 “significant” finding you found in your exploration set), i am reasoning you are simply engaging in p-hacking but withholding the 19 analyses you performed in your exploration set.

          Now, if “All tests run on the exploration set are run on the validation set and are reported in the paper.” then i can see how this removes analyses flexibiilty in reporting. But i don’t understand how this could possibly work in practice. How can people keep track of *all* analysis i performed in an exploration set so the exact same number of analyses can be performed in the validation set?

          Also, if things were done like this (which again, i see as very hard if not impossible to do in practice), why not simply pre-register your planned analysis and perform them on the *total* data set when data collection is finished? That would boil down to the same thing i would reason from a trying to stop flexibility in analyzing in relation to what’s actually being reported(?)

          2) You wrote: “From a technical perspective spurious significant tests will occur at the same rate in large and small datasets (5% of the time). That said, with large datasets you have the ability to detect small changes, so unimportant differences are often statistically significant.”

          Yes, i understood that with large data sets (large N) unimportant differences are often statistically significant. This is one of my worries of (a possible focus on) large data sets, and examining pre-existing large data sets.

          I however also reason that the chances are that large data sets will contain not only more participants (larger N), but also more variables measured. This is the other meaning of what i meant with “large” data sets. If this makes sense, i reason large data sets (as in more variables measured) could have more spurious findings.

          Findings that can in turn be “mined”, and findings that may be hard to ever correct due to the large N that can almost surely never be matched by any “skeptical” researchers (e.g. compare that to Bem’s research, and the many replications that followed it).

          (Als see Meehl’s “everything correlates with everything” law in this regard https://journals.sagepub.com/doi/10.2466)/pr0.1991.69.1.123)

        • Perhaps working with your example will make it clearer. Suppose that you are a researcher and do 20 tests, that the null hypothesis is in fact true in all these tests and that one of the tests comes out as “significant.” The probability of observing at least one significant finding is

          1 – .95^20 = 64%

          and the probability that you observe exactly one is 20 * .05 * .95^19 = 38%, which is not very compelling. However, if the reader doesn’t see that you did 20 tests, only the one that you reported, they will think it is 5%.

          Now consider the case where you have a split sample. The key here is that the researcher doesn’t keep track of all the analyses that they have done, they just report the findings they think are real. In this case, the researcher is chasing noise in the exploration set, and they may think that they’ve discovered a real finding when they write up the one “significant” test into a paper. However, when they unblind it will replicate in the validation set only 5% of the time, just as we would expect.

          The reader doesn’t have to worry about how many analyses they did in the exploration set because these are irrelevant when considering the sampling distributions of the analyses done in the validation set.

        • Quote from above: “However, when they unblind it will replicate in the validation set only 5% of the time, just as we would expect. ”

          Thank you for all the replies! I am really bad with maths and statistics, and i can’t follow your calculations. I can only try and think about matters on a more global, abstract level. So, please correct me if i am wrong about the following.

          It is exactly this way of global, abstract thinking though (that might be severly flawed!) that leads me to question your comment quoted above.

          I would reason that when you randomly split up a very large data set, that the exploration set and validation set are more likely to be similar concerning the values of the various variables measured compared to smaller data sets.

          Now, if this is correct, i would reason that with very large data sets the exploratory findings will almost always be also found in the validation data set.

          Now, if this is correct, and i would perform 20 analyses on the exploration set and would find 1 statistically significant analysis and would subsequently perform only that single one of the validation set, i would think that that would a) probably be statistically significant, and b) that significance level should actually have been corrected for multiple testing given the other 19 analyses i did on the exploration set, but is now not done due to splitting the data set and treating it like it’s a new set (replication?) or something.

          If this all makes sense, i stick by my (very possibly flawed!) reasoning that splitting up very large data sets in exploratory and validation sets, and then picking a sub-set of your analyses you performed on the exploratory data set and performing them to the validation set, is similar to p-hacking or HARK-ing or walking through the garden of forking paths.

          I wonder if my hunch can be checked by making some random data sets with spurious (and perhaps true) effects, and with an increasingly large number of participants. Then you could split up all the data sets in 2 (exploration set, and validation sets: 50%-50% or 80%-20% or whatever) and then see if, and how, the validation analyses differ from the exporatory analyses given the number of particpants of the different data sets.

          Anyway, this will all probably make no sense. As i said, i am really bad with statistics and math. Thank you for the reply though, it will possibly be helpful to others who can understand it better then i can.

        • Ian wrote: “What happens in practice is that not all tests are reported (e.g. a subgroup comparison would have been reported only if significant) and outcomes are defined to get maximal significance (e.g. defining a the time scale or degree of improvement in a way to maximally differentiate between control and treatment). This, which Andrew calls the garden of forking paths, is the heart of the replication crisis.”

          It’s not just the number of formal tests that are reported — the garden also includes the informal comparisons that led to doing more tests. For example:

          Suppose a group of researchers plans to compare three dosages of a drug in a clinical trial. There is no pre-planned intent to compare effects broken down by sex, but the sex of the subjects is routinely recorded. The pre-planned comparison shows no statistically significant difference between the three dosages when the data are not broken down by sex. But, since the sex of the patients is known, the researchers decide to look at the outcomes broken down by combination of sex and dosage, notice that the results for women in the high-dosage group look much better than the results for the men in the low dosage group, and perform a hypothesis test to check that out. So the researchers have Informally done 15 unplanned comparisons, not just one (even though there is only one unplanned formal hypothesis test performed): there are 3X2 = 6 dosage-by-sex combinations, and hence (6X5)/2 = 15 pairs of dosage-by-sex combinations.

        • Ian,

          RE: I think we could fix this fairly easily while still allowing researchers full freedom to explore and craft their analysis in response to the data they get. We could simply require that the data be split into an exploration set and a validation set and blind the researcher to the validation set until submission for publication. Upon submission, the blind is broken and the same analysis that was decided upon in the exploration set is run on the validation set (ideally by an independent party).
          ——-
          I concur with this fix. What surprises is me that such strategies get any push back.

        • Thanks. To be fair, I’m not sure that this strategy has been proposed anywhere. It does seem pretty intuitive to me.

          “This Dad Discovered One Trick To Defeat The Replication Crisis. Click Here to Find Out What He Did!”

        • Woops: Formatting ate the punch line.

          “This Dad Discovered One Trick To Defeat The Replication Crisis. Click Here to Find Out What He Did!”

          [narrator: It was Replication]

  9. RE: Publish his data. His raw data. Raw raw raw. All of it, along with a complete description of his data collection and experimental protocols, and enough computer code to allow outsiders to do whatever reanalyses they want.

    This type of proposal surfaces from time to time in the meta-analysis literature with never getting much traction. It subverts the whole papers are academic contributions, judged so by peers and editors which universities then can use to evaluate faculty and make prestige claims to forward the universities interests.

    The github suggestions made by Ed Hagen might address part of this.

    • Keith:

      Sure, I can believe this will never happen. But, again, I was writing the above post, not with the intention of reforming science, but just giving a suggestion to journal editors who didn’t seem to see any options other than either publishing the paper as is, or rejecting it entirely. Perhaps it could be useful for them to have this third alternative. The worst that happens to the journal is that the author refuses the offer, and then they’re off the hook.

  10. But then… why journals?

    If we wish to publish raw data separately from the analysis, why should we publish it in Journals? Could two peers and an editor say you anything new about your own dataset? Does publication status add more validity to raw numbers?

    In this particular case, would JPSP rejecting the paper and telling author “go publish you data in your blog or somewhere” be identical to option C?

    • Quote from above: “But then… why journals?”

      Journals (and editors and peer-reviewers) 1) make making lots of money possible, and 2) make making manipulation of science possible.

      Both things may me be appreciated by some folks.

        • Quote form above: “There at least some reasons to believe papers published in journals are more reliable then random blog post. ”

          It could, or not, probably depending on the specifics of the situation. In general, i see many problematic issues with the journal-editor-peer-review model of current science. I can only see them possibly being solved when the journal-editor-peer review system will be removed from science.

          Possibly see the following 2 papers about the possibly problematic nature of the journal-editor-peer-review system:

          1) Smith, R. (2006). “Peer review: a flawed process at the heart of science and journals” https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1420798/

          2) Crane, H. & Martin, R. (2018) “In peer-review we (don’t) trust: How peer review’s filtering poses a systematic risk for science” https://www.researchers.one/article/2018-09-17

        • Mikhail,
          Raw data can be pre-registered. I’m not sure how one can devise optimal access criteria. Raw data can tell you a lot, I would speculate. It’s important particularly in light of the ‘noise’ factor.

  11. In my experience, journals rarely reject on theoretical grounds (i.e., a mismatch between data and theoretical prediction). This is because people often re-write the theory section based on the results (HARKing) or add bits of theory based on post-hoc reviewer suggestions. It often feels like you could rely on dozens of competing explanations, each about as acceptable from the last.

    But … I think that these logical concerns should really be taken more seriously. For example, let’s say I have a crackpot theory that lightbulbs are powered by fairies that are excited by light switches being flicked to a particular position. I run a bunch of trials on various lightbulbs and find that the majority of the time, when I flick a light switch the lightbulb turns on. It’s a very strong relationship. I publish those data and argue that it supports my theory.

    Now someone should rightly argue that lightbulbs work by an entirely different mechanism. But within the current rules of the publishing game, I could respond by saying: “It’s a very replicable finding. Anyone can do this simple experiment. Try it yourself, if you flick a lightswitch on you’ll find this effect.” Hell, I could do a pre-registered replication of that study and guarantee the findings again.

    But, the conclusions are wrong because the data using this experimental paradigm (i.e., recording light switching events) cannot possibly provide evidence for this theory (i.e., that they are powered by fairies). I could run those experiments all day long, but my conclusions would still be wrong. Bem’s study was wrong for a similar reason: There is no logical mechanism to explain the WHY ESP is happening. I think that a logical explanation does not necessarily mean a pattern of results is correct, but an illogical (or absent) set of logical explanations should be something that precludes publication.

    I’m glad in retrospect that Bem’s study was published because of the energy it brought the open science movement. But today, if a similar study were to come through an editor’s desk, it probably should be rejected for lack of a plausible/logical mechanism for the effect.

  12. Under the Option C hypothetical scenario, would the published “article” be simply a data dictionary of some kind along with a methods precis?

    Because the raw data per se would (for the majority of useful studies) not be the kind of thing you’d want to print out value for value. In the general case it would somewhere in the range from too large to print out up to truly huge.

    If we imagine an world in which Option C becomes the norm, I’d think the raw data would always be “published” by making it available for download on a web site somewhere. In that case, wouldn’t the vestigial written “article” eventually evolve to have the author’s gloss on some sort of initial analysis? Otherwise the journal becomes just a list of README file printouts and web links.

    I’d think what would end up in practice looks a lot like a conventional research article with a simple requirement to make the raw data freely available. It wouldn’t make sense to publish a link to the data with no commentary at all from the data’s owner.

    • Brent:

      Yes, the publication would be the data and codebook in a repository, along with a short article describing exactly how the data were collected (all the survey forms, experimental protocol, etc) and explaining why the data are important. So the article would need to discuss the importance of the question being asked, address the existing literature, talk about future work, etc. Data summaries and data analyses would be fine, but the point of the data-focused article would be the data. In particular, if the data are deemed potentially important, it would not be required that some exciting conclusion be drawn.

      This is similar to preregistration, where you get the article accepted for publication based on the research design, with the analysis coming later. But I think the idea is more general.

      So, again, the article in the data-focused journal is not just a list of links. It would have to demonstrate the importance of the research and situate it in the literature, just like any other scientific article. The point is that if you’re an experimenter with cool ideas and cool data, you can share all that with the world, without having to do current standard practice which is to (a) make strong conclusions not really supported by your data, and (b) hiding your raw data so nobody else can reanalyze.

      This is an important point, maybe deserving its own blog post or manifesto.

Leave a Reply to Sameera Daniels Cancel reply

Your email address will not be published. Required fields are marked *