Daryl Bem and Arthur Conan Doyle

Daniel Engber wrote an excellent news article on the replication crisis, offering a historically-informed perspective similar to my take in last year’s post, “What has happened down here is the winds have changed.”

The only thing I don’t like about Engber’s article is its title, “Daryl Bem Proved ESP Is Real. Which means science is broken.” I understand that “Daryl Bem Proved ESP Is Real” is kind of a joke, but to me this is a bit too close to the original reporting on Bem, back in 2011, where people kept saying that Bem’s study was high quality, state-of-the-art psychology, etc. Actually, Bem’s study was crap. It’s every much as bad as the famously bad papers on beauty and sex ratio, ovulation on voting, elderly-related words and slow walking, etc.

And “science” is not broken. Crappy science is broken. Good science is fine. If “science” is defined as bad articles published in PPNAS—himmicanes, air rage, ages ending in 9, etc.—then, sure, science is broken. But if science is defined as the real stuff, then, no, it’s not broken at all. Science could be improved, sure. And, to the extent that some top scientists operate on the goal of tabloid publication and Ted-talk fame, then, sure, the system of publication and promotion could be said to be broken. But to say “science is broken” . . . . I think that’s going too far.

Anyway, I agree with Engber on the substance and I admire his ability to present the perspectives of many players in this story. A grabby if potentially misleading title is not such a big deal.

But what about that Bem paper?

One of the people who pointed me to Engber’s article knows some of the people involved and assured me that the Journal of Personality and Social Psychology editor who handled Bem’s paper is, and was, no fool.

So how obvious were the problems in that original article?

Here, I’m speaking not of problems with Bem’s theoretical foundation or with his physics—I won’t go there—but rather with his experimental design and empirical analysis.

I do think that paper is terrible. Just to speak of the analysis, the evidence is entirely from p-values but these p-values are entirely meaningless because of forking paths. The number of potential interactions to be studied is nearly limitless, as we can see from the many many different main effects and interactions mentioned in the paper itself.

But then the question is, how could smart people miss these problems?

Here’s my answer: It’s all obvious in retrospect but wasn’t obvious at the time. Remember, Arthur Conan Doyle was fooled by amateurish photos of fairies. The JPSP editor was no fool either. Much depends on expectations.

Here are the fairy photos that fooled Doyle, along with others. The photos are obviously faked, and it was obvious at the time too. Doyle just really really wanted to believe in fairies. From everything I’ve heard about the publication of Bem’s article, I doubt that the journal editor really really wanted to believe in ESP. But I wouldn’t be surprised if this editor really really wanted to believe that an eminent psychology professor would not do really bad research.

P.S. I wrote the post a few months ago and it just happened to appear the day after a post of mine on why “Clinical trials are broken.” So we’ll need to discuss further.

P.P.S. Just to clarify the Bem issue, here are a few more quotes from Engber’s article:

Even with all that extra care, Bem would not have dared to send in such a controversial finding had he not been able to replicate the results in his lab, and replicate them again, and then replicate them five more times. His finished paper lists nine separate ministudies of ESP. Eight of those returned the same effect.

Bem’s paper has zero preregistered replications. What he has are “conceptual replications,” which are open-ended studies that can be freely interpreted as successes through the garden of forking paths.

Here’s Engber again:

But for most observers, at least the mainstream ones, the paper posed a very difficult dilemma. It was both methodologically sound and logically insane.

No, the paper is not methodologically sound. Its conclusions are based on p-values, which are statements regarding what the data summaries would look like, had the data come out differently, but Bem offers no evidence that, had the data come out differently, his analyses would’ve been the same. Indeed, the nine studies of his paper feature all sorts of different data analyses.

Engber gets to these criticisms later in his article. I just worry that people who just read the beginning will take the above quotes at face value.

48 thoughts on “Daryl Bem and Arthur Conan Doyle

  1. For the rest of that semester and into the one that followed, Wu and the other women tested hundreds of their fellow undergrads. Most of the subjects did as they were told, got their money, and departed happily. A few students—all of them white guys, Wu remembers—would hang around to ask about the research and to probe for flaws in its design. Wu still didn’t believe in ESP, but she found herself defending the experiments to these mansplaining guinea pigs. The methodology was sound, she told them—as sound as that of any other psychology experiment.

    So, not sound at all. Why is Bem hiring people who have never even heard of ESP? It seems strange to me for someone to be college age and never heard of it.

    Also it is great that criticizing NHSTers will be called mansplaining now.

    • Nitpick warning: I think your last sentence is a pretty strained reading of the quoted paragraph. Most of the Slate article’s readers are likely on board with questioning research design, and the article’s tone, while admirably restrained, doesn’t exactly paint Bem and his team in a great light. Consider the same research assistant’s role in the section on porn later in the article:

      “Not long after she was hired, Jade Wu found herself staring at a bunch of retro pornography: naked men with poofy mullets and naked girls with feathered hair. […] Wu didn’t want to say out loud that the professor’s porno pictures weren’t hot, so she lied: Yeah, sure, they’re erotic.”

      I think Engber’s point in spending time on Wu was to tell a (broadly correct) side story about the uncomfortable positions that junior women in academia find themselves in. The male subjects who hung around appear to have been questioning the experimental design anyway, not the data analysis, which they couldn’t possibly have known about just from participating in the study. If criticizing NHST is going to be called mansplaining, it will only be because the same label is applied to any criticism of a woman’s work by a man. That’s definitely the case in some parts of some campuses, and the discourse around Amy Cuddy has taken a similar tone sometimes, but I’m unconvinced that it’s where the whole NHST/replication conversation is going.

    • “As sound as any other psychology experiment” is basically the response I get when I raise questions in comments on the posts of a psychology Ph.D. student at another, well-trafficked academic blog.

  2. I think this was my favorite quote. It echoes what has been said on this blog before about the use of data as rhetoric:

    “If you looked at all my past experiments, they were always rhetorical devices. I gathered data to show how my point would be made. I used data as a point of persuasion, and I never really worried about, ‘Will this replicate or will this not?’ ”

  3. I’ve been thinking a lot about crap science the last couple days with your Fritz post and a bioRxiv controversy I threw myself into (I can’t help causing trouble)–basically some low quality preprints that were meant to serve as a post-publication peer review of a Cancer Cell paper were posted on bioRxiv, and the Cancer Cell authors requested the preprints to be retracted. This story raises issues of screening standards at bioRxiv and what to do with crap that gets posted. If you have some time to kill my post is here: https://medium.com/@OmnesRes/crap-spotted-at-biorxiv-15eecd58be6f

    I really think there is a simple explanation for all the crap science we are seeing: there are a lot of crappy scientists. The problem is that it is so hard to determine who is a crappy scientist.

    Think about it. In what other field do you not have to demonstrate any baseline level of proficiency? To get a PhD all you have to do is drink the Kool-Aid of your advisor and p-hack until you get the results they want.

    When you publish a paper you don’t have to show any data. You don’t have to show your analysis code. You don’t have to provide detailed methods. If someone tries to raise alarms about the work the journal and university do everything possible to downplay the problems.

    It’s just so easy to be an imposter in science, you could call it an imposter’s paradise:

    Keep spending most our lives
    Livin’ in an imposter’s paradise
    Been spending most their lives
    Livin’ in an imposter’s paradise
    We keep spending most our lives
    Livin’ in an imposter’s paradise
    We keep spending most our lives
    Livin’ in an imposter’s paradise

    P.S. I hope you saw Retraction Watch’s coverage of my recent Wansink blog posts:

    As an aside, I feel kind of bad for the journals here. They got tricked by Wansink into publishing his work, then when they make the effort to issue corrections they get criticized by me (although I will eventually write a post about the journals that didn’t post corrections). Then they get played by Wansink a second time with these bogus corrections.

    • I feel kind of bad for the journals here…Then they get played by Wansink a second time with these bogus corrections.

      Don’t journals claim their purpose is as a gatekeeper so people don’t get inundated with crap? A journal that can repeatedly “get tricked” like that should just shut down.

      • It seems to me that journals mainly do 2 things:

        1. Identify research they think is interesting enough for their audience.

        2. Correct obvious problems with articles such as plagiarism (although even here it seems most journals just ignore the problem).

        Journals don’t seem interested in, or capable of, determining if the work they publish is scientifically sound. They rely on peer-reviewers for that, and as we’ve seen, that hasn’t worked. So when someone raises concerns the journals have limited resources to investigate the problems. Should they send out the concerns for peer review?

  4. This article makes me think of what Mark Palko has been writing at the West Coast Stats Views blog about how New York Times Magazine articles tend to breathlessly hype the subject in the beginning only to reveal the fatal flaws later on. The punchline of this article is that when Bem does the pre-registered replication the results are null – until he adds new, not pre-registered analyses! Despite this, the tone of most of the article implies that there were no methodological issues – compare the author stating that Bem had “more rigor than anyone had before” and his paper was “methodologically sound”, while E.J. Wagenmakers only “believed the paper had at least one glaring problem”. Or the juxtaposition of how Ross’s research shows that people “cling to their beliefs in the face of any challenge” and Wagenmakers becoming physically unwell just by reading Bem’s paper. Also, after Wagenmakers starts reading the paper and getting sick, we get 8 paragraphs on methodology and the friendship of Bem and Lee Ross before Wagenmakers “finally” manages to finish it – this Wagenmakers character doesn’t sound very trustworthy at all!

  5. I have trouble grasping Bem’s work and that is not a compliment. As I see it, he’s testing a choice function over a defined set where the function is apparently random, meaning all contextual entanglements have been removed except for the set definition, which one can visualize as an array or as a matrix, etc. depending on whether you want to infer directional or existential statements. He seems to have decided that multiple iterations – call that n – sufficiently tests the choice function, but I don’t see how that directly translates into the concept advanced that this choice function is not entirely random despite being labeled as such. That is, the choice function may be within individuals so you’d need many iterations per individual, call that N, just to start an analysis. By contrast n tells you there is a range of results without telling you there’s anything more to variations within any summary mapping or counting of n, like 18 tails in a row and eventually the coin flips approach even. This then connects to N: you can’t just add up n but rather need to consider each set of n as an N, as an individual, so you have both actual individual N and a group N’. At N’ level, N effects may disappear. Or no effect at N may appear in N’, which would indicate that individual group N’ got lucky. If you looked at N, my first question beyond the size of N (per individual since N = trials per individual) would be whether the defined set was varied. That is, does an effect appear only when the defined set is sufficiently restricted, like 1 of 5 versus 1 of 10,000 choices of equal chance, because the former may skew more?

    These are invariance tests. A lot of what you argue for is whether the result is invariant under perspective: is it true only if you tilt your head this way then that, following the exact path you need, or is it only true if you assume all these other things are true, which is I’d say the big sin of so much economic modeling (that you take this value as though it represents a stable outcome for the complex functions you’re controlling for). Bem’s work is not invariant under perspective when you describe basic issues associated with evaluating a choice function over a defined set: there are various forms of ‘n’ groups – where ‘n’ means n, N and N’ – and each needs to be evaluated to accurately describe any variance between the postulated choice function’s predicted and actual results.

    The logic has to go both directions. So if you claim N is true, meaning individuals have ESP, you then have a range for N, and if you claim there is ESP over n, you must isolate whether that ‘effect’ is due to variation in N’ or within N. I have trouble thinking there could be a claim ESP is true for N’ other than by artificially stratifying the winners as though there are no losers. I don’t see that Bem understands these relationships or that there are chains and groups within any ‘n’. I can understand this because choice functions are difficult to understand given their odd existence within the conceptions of set theory, but in this case he’s literally describing a choice function: choose from a defined set with evaluable characteristics related to size or scope and the degree to which set elements are actually fungible (meaning any less than apparent non-randomness), even if the defined set is presumed random. It’s a version of reaching into a sock drawer to get a pair: are the socks together or adequately shuffled, how many colors and patterns, etc.

  6. Engber is talking nonsense when he says ESP is “logically insane”: ESP may exist or not, but there is absolutely nothing “illogical” about the possibility that it might exist. The only way to justify his claim is to think that only materialism makes any logical sense: but there is no logical basis upon which to defend such a claim.

  7. At the time Bem’s article was reviewed, I think that even the editor of JPSP was really not sufficiently appreciative of the power of data analytic flexibility. Back then even I doubted that people could be betting 5 studies supporting affective priming unless it was true. The Bem article proved, beyond reasonable doubt, that data analytic flexibility, along with study reporting bias and determination, was enough to prove anything.

    • Michael:

      Yes. What’s stunning in retrospect is how (a) at the time, the Bem paper looked like standard practice, maybe nothing special but nothing horrible either; but (b) in retrospect, Bem’s paper looks just terrible. It’s amazing how the problems just jump out, once we know what to look for. It’s like one of those color-vision tests the eye doctor gives you, where when you wear the 3-D glasses the images just leap off the page.

      This was something I was trying to convey in my “What has happened down here is the winds have changed” article.

      What’s really weird is reading all the comments by Fritz Strack in this recent thread, in that he’s still stuck in 2010, defending unreplicated studies that are full of forking paths (see, for example, this comment). We’re all woke and this dude’s still sleepwalking. And it seems that screaming in his ear is not gonna wake him up.

      Bem too, I guess, but somehow Bem doesn’t bother people as much, perhaps because he doesn’t really seem to be trying to argue his case; at this point he’s pretty much operating outside science entirely.

      • What I don’t understand is that psychologists of all people should know “how the sausage is made”. Or are the heads of the labs so far removed from the study design, data collection, and analysis that they have no inkling that there could be a problem?

        At least in biology I could see this as being a legitimate excuse–what may become known as the “Trump/Wansink Defense”: I had no knowledge of what was happening in my own campaign/lab. Famous investigators in biology are constantly traveling giving talks, have labs with dozens of members, and as a result have very little knowledge of what is actually going on.

        Even then, it surprises me that people like Fritz can’t seem to grasp the concept of the file drawer problem. Are they really so blinded in their thinking that they are able to look at any study as a successful replication and can’t fathom the possibility that there are failed replications out there that never got published?

        Whatever the case, as a non-psychologist it’s very entertaining for me to watch all this drama unfold. It’s not every day you get to watch people slowly realize their entire life’s work has been a waste of time.

        • “It’s not every day you get to watch people slowly realize their entire life’s work has been a waste of time.”

          This is one of my hypotheses as to why (at least it seems) senior researchers are so reluctant to improving matters, and (sometimes) even appear to stand in the way of it (e.g. by simply not teaching students correctly).

          Another hypothesis is that they belong to a secret society aiming to control the general population by sending out press-releases based on their research which always boils down to how everything around them controls them, and how they have no free will, and how they should not think/reason but follow their intuition, etc. :P

        • Anonymous,

          Your first hypothesis makes sense to me as one contributing factor to the problem.

          As for your second hypothesis — I hope you realize you need to vary which side of your cheek you put your tongue in; otherwise your tongue becomes dysfunctional for its primary purposes.

          However, I think that the “follow their intuition” part is something that needs to be taken seriously. I don’t mean that I advocate “follow your intuition” (I don’t advocate it!), but that many people do believe in that maxim. Dealing with them can be really difficult. You either need to do the equivalent of bopping them over the head with a frying pan, or do the persuasion with oodles and oodles of empathy and tact, or sometimes enlist peer pressure, or find someone they idolize to give the criticism.

          Another contributing factor seems to be what the people who care about it call “loyalty” — to a group, or to individuals, or to beliefs (e.g., “trust your intuition” could be one such belief).

        • Jordan:

          What’s scary is that people like Strack, Bem, Bargh, etc., will take unsuccessful replications or even unrelated experiments and act as if they are successful replications. It’s beyond the file drawer.

        • I think we need to be very open to the possibility that these people just aren’t very smart. Here is Susan Fiske showing that she has no idea how statcheck works:

          ”’Some people have set up sort of “gotcha” algorithms that apparently crawl through psychology articles and look for fraudulent p-values [a measure of the likelihood that experimental results weren’t a fluke]. But they’re including rounding errors that don’t change the significance levels of the results, and they’re doing it anonymously.”’

          Or perhaps she knows how statcheck works but is purposely trying to misrepresent statcheck so that people won’t take it seriously. It’s so hard to tell if people are being devious or if they are just stupid, c.f. Wansink.

        • Jordan:

          Regarding the Fiske quote, I have no idea but my guess is not that she does not have the ability to understand what statcheck does, but rather that (a) she didn’t feel like putting in the effort to understand it, but (b) she felt free to criticize it.

          The whole “fraudulent” thing is just a red herring on Fiske’s part, I think. Just for example, consider the erroneous statistics that have been found in some of Fiske’s papers. Nobody’s claiming any knowledge that these errors represent fraud: that would be very hard to know. All anyone’s saying is that they are mistakes. And rounding errors can be a problem too, in that they introduce additional researcher degrees of freedom, but that’s another story, and again it does not imply fraud.

          Stepping back, it does seem that some people work extra hard to avoid trying to understand potential problems with their published work. For example, many years ago I did my best to explain to Satoshi Kanazawa that his studies of sex ratio were hopeless, that he would literally need 100 times the sample size to discover anything at all. This is just math. It’s not trivial math, and it could be beyond Kanazawa’s abilities—that I have no idea—but it was certainly not beyond his abilities to find a colleague who could explain it to him.

          Perhaps there’s some tradeoff between stupidity and dishonesty: the smarter you are, the more intellectually dishonest you have to be to avoid confronting your problems. Conversely, if you don’t really understand what you’re doing, you can walk around in a haze, and maybe there’s something more honest about that. That’s been my impression, for example, with newspaper columnist David Brooks, who notoriously refuses to correct mistakes and errors in his columns and books. Brooks seems genuinely clueless, as if he can’t possibly run a correction because he has no ability to evaluate the claims that he so confidently publishes.

        • In this paper Moher, D et al Does quality of reports of randomised trials affect estimates of intervention efficacy reported in meta-analyses? Lnacet 1998 the authors argued some that is equivalent to a claiming that the weather’s temperature measured on the Celsius scale is mare stable than when measured on the Fahrenheit scale.

          I spent a lot of time with pretty much all the authors of the paper in person and email and Sander Greenland and I tried to give an obvious explanation in print here https://academic.oup.com/biostatistics/article-lookup/doi/10.1093/biostatistics/2.4.463

          But given the first author has done an awful lot of good work to improve clinical research (e.g. http://metrics.stanford.edu/about-us/bio/david-moher ) it would be reasonable to think they just don’t get it.

          Now the paper has almost 3000 citation and is still regularly being cited :-(

        • I guess I found other aspects of Fiske’s quote problematic. First, what is being done “anonymously”? Sure, statcheck was automated and the results were posted to PubPeer, but it’s not like this is some secretive hacker organization, we know who developed statcheck and you can contact them about problems, or comment on the PubPeer report.

          Second, I guess rounding errors will be picked up by statcheck, but they would have to be blatant errors as statcheck allows generous rounding. It’s impossible to know for sure what percentage of statcheck errors are “rounding errors”, as they could just as likely be typos, miscalculations, or even fraud.

          They found that 1 in 8 errors affected the significance of the result, so presumably those weren’t rounding errors. It’s unlikely, but maybe the other 7 out of 8 are indeed due to rounding incompetence. I think you could argue this is still fairly concerning, as it indicates people are rounding by hand instead of using code, and rounding poorly at that. From my experience it seems pretty clear Wansink rounds by hand, so you might say rounding by hand (and doing it poorly) can be seen as a proxy for poor work, and I think it is worth reporting these errors as it may indicate we might want to take a closer look at the other numbers in the paper and perhaps reanalyze the data (if available).

          Third, statcheck isn’t a “gotcha” algorithm. They didn’t honeypot the psychologists into reporting wrong numbers. They aren’t catching people making who/whom errors. These are legit mistakes. But I guess psychologists don’t really view numbers as very important.

          One of the best uses of statcheck is to check your own papers before submitting them to catch any mistakes you may have. I think a common one is copying and pasting the wrong statistic from STATA/SAS/SPSS output, but who knows.

        • Another possibility for people like Fiske and Brooks is that they are in the “trust your intuition” camp. (In Brooks’ case, and possibly to a lesser extent Fiske’s, they may also be in the “take criticism like water sliding off a duck’s back” camp.)

        • “p-values [a measure of the likelihood that experimental results weren’t a fluke].”

          This definition manages somehow to be both meaningless (what does “measure of the likelihood” mean) and wrong (you assume the results “weren’t a fluke”). Mass Confusion!

          people slowly realize their entire life’s work has been a waste of time.

          Yep, it is very depressing, then frustrating, then scary to watch this play out. In the end it is usually dealt with by denial and/or kill the messenger.

      • Andrew Gelman:

        As my name is coming up repeatedly, my comments must have left a lasting impression on you.

        You may not be aware that in science, the effect of screams are not measured by the noise they create, but by the arguments they may entail. Unfortunately, however, your complete ignorance about the topic of “facial feedback” leaves little to ponder on. The fact that this finding has a long history in the study of emotion (it goes back to Darwin and William James), that it has been demonstrated in different ways and many times before and after the Strack et al. (1988) study, that very recently (reported in a conference in Granada, July 2017) a group of Israeli scientist has convincingly (preregistered, high powered, data submitted to OSF) shown that the failure of the RRR “direct replication” was due to a significant deviation from the original study, all of this does not interest you and does not prevent you from attacking me and my work in slanderous ways.

        Fortunately, I have a thick skin and your attacks don’t disturb my night’s sleep. But I like to argue. So you won’t get rid of me, unless you exclude me from your blog (which would not surprise me).

        • Fritz,

          I understand that you might not expect a fair opportunity to reply or even lightly troll (technical internet term) on Andrew’s blog. But I believe Andrew when he says “In nearly 15 years of blogging I think I’ve deleted fewer then 5 comments based on content when people are extremely rude.” http://statmodeling.stat.columbia.edu/2017/07/01/no-im-not-blocking-deleting-comments/

          But as social psychologists, I have to say I’m pretty disappointed by both your and Susan’s response to criticism both internally to the field and by those external to the field like Andrew. I expected better of people who have been leading the field. Frankly, I expected something like Dan’s reply on Uli’s blog. To quote Dan:

          “The argument is inescapable: Studies that are underpowered for the detection of plausible effects must occasionally return non-significant results even when the research hypothesis is true – the absence of these results is evidence that something is amiss in the published record. Furthermore, the existence of a substantial file-drawer effect undermines the two main tools that psychologists use to accumulate evidence for a broad hypotheses: meta-analysis and conceptual replication. Clearly, the experimental evidence for the ideas I presented in that chapter was significantly weaker than I believed when I wrote it. This was simply an error: I knew all I needed to know to moderate my enthusiasm for the surprising and elegant findings that I cited, but I did not think it through. When questions were later raised about the robustness of priming results I hoped that the authors of this research would rally to bolster their case by stronger evidence, but this did not happen.

          I still believe that actions can be primed, sometimes even by stimuli of which the person is unaware. There is adequate evidence for all the building blocks: semantic priming, significant processing of stimuli that are not consciously perceived, and ideo-motor activation. I see no reason to draw a sharp line between the priming of thoughts and the priming of actions. A case can therefore be made for priming on this indirect evidence. But I have changed my views about the size of behavioral priming effects – they cannot be as large and as robust as my chapter suggested.

          I am still attached to every study that I cited, and have not unbelieved them, to use Daniel Gilbert’s phrase. I would be happy to see each of them replicated in a large sample. The lesson I have learned, however, is that authors who review a field should be wary of using memorable results of underpowered studies as evidence for their claims.”


          Dan’s position to me is imminently sensible, pragmatic, and an exemplar of the kind of leadership I expected from other prominent social psychologists.

          We can do better and it’s a disappointment when we do not. It’s even more of a disappointment to see leadership fail so consistently in setting norms that better psychological science.

        • Fritz, given that you like to argue, I hope you will respond somewhere to the many substantive criticisms of “From Data to Truth in Psychological Science”. There are three critical comment below the article on Frontiers. I see you posted the same “keep your gunpowder dry” comment on Neuroskeptic’s blog (http://blogs.discovermagazine.com/neuroskeptic/2017/06/09/data-truth-null-results/), but as of yet have not responded to any of the criticisms made in the original post or in the comments section. There’s also Erickson’s response yesterday to your list of successful demonstrations of the pen procedure: http://statmodeling.stat.columbia.edu/2017/07/08/bigshot-psychologist-unhappy-famous-finding-doesnt-replicate-wont-consider-might-wrong-instead-scrambles-furiously-preserve-theories/#comment-524450

          I doubt many of the people criticizing your article for Frontiers care one way or the other about facial feedback. What I see are criticisms of your approach to using data as evidence, and these are the criticisms that you don’t seem to be engaging with.

        • No response from Gelman. It is on this blog where arguments and empirical evidence don’t count.
          I’m ready for the next slur.

        • Prof. Strack, I appreciate that you are spending some of your time here in Andrew’s blog responding to comments and posts. And I guess it is not an easy position, given that it is you research work that is being criticized — it’s easy to adopt a reactive stance.

          But, please, do not use ad hominem arguments, saying that people on this blog don’t give a damn about empirical evidence. As an assiduous reader, I am pretty confident that everyone here takes empirical evidence very seriously. That is the very reason why your defense was criticized: your first reaction in the Frontier article is to diminish the impact of the replication study. You even evoke Popper to argue that results must be critically evaluated by experts in the field, as if ad hoc evaluations of unstable experimental results should be given primacy over the results of observations derived from risky hypothesis.

          I’m certainly no expert in the field of facial feedback hypothesis. But I do remember when I first heard about your famous 1988 paper in a Social Psy class. I was pretty impressed by it, but I certainly didn’t have the statistical knowledge at the time to evaluate it more carefully — and so didn’t my professor, too. In fact, studying statistics has opened my eyes to the questionable research practices so commonplace in our field and interpreted as good-enough science.

          In fact, I do believe some liberal use of statistics is already present in your original paper, as I tried to argue in the other post. I also would like to know how to ‘critically evaluate’ such a diverse range of results in your list of 20 studies that ‘demonstrated the predicted effect’. I looked at three at random and found some results that didn’t demonstrate the effect at all! But you seem to conclude that they do, so I would like some enlightenment in how to aggregate them.

          Could you also please share the link to the pre-registration of the study by the Israeli team? And for the camera-is-a-moderator study, too. Even if the results aren’t ready for publication, the pre-registration should be public, I guess.

        • Prof. Erikson,
          I apppreciate your concern about ad hominem arguments. However, it is my impression that this predominantly applies to the owner of this blog who called me a “Bigshot psychologist, unhappy when his famous finding doesn’t replicate, won’t consider that he might have been wrong; instead he scrambles furiously to preserve his theories”. This may be the level of the current interactions, but it does not promote a fair and balanced exchange.

        • Fritz:

          Some relevant discussion of your work comes from Kimmo Eriksson here, also further discussion by me at that link (in particular, discussing the mistake of considering only one phenomenon at once), also this paper by Wagenmakers et al., also this news article by Engber, also this discussion by Neuroskeptic, also various comments on this blog such as this from Erikson who specifically discusses some of the papers you cited.

          From a statistical perspective, the problem is that “p less than .05,” which has conventionally has been considered to be strong evidence of an effect—is not so strong at all, as we’ve learned from many many examples over the past ten years, including the work of Daryl Bem, the claim that beautiful parents are more likely to have girls, the claim that elderly-related words cause people to walk more slowly, the claim that women were 20% more likely to vote for Barack Obama during a certain time of the month, and so on and so forth. The problems with overinterpretation of statistically significant p-values have been explained in a series of papers by Simonsohn, Nosek, and others.

        • Fritz,

          An ad hominem attack is:

          “Refuting an argument by attacking some aspect of the person making it, rather than addressing the content of the argument itself.

          Ad hominem is very often mistakenly claimed in cases where an argument’s opponent attacks its proponent in addition to presenting a valid counterargument. “You’re stupid, therefore your argument is invalid” is an ad hominem; “your argument is invalid, therefore you’re stupid” (or “Your argument is invalid and you’re stupid”) is not.”

          So let’s break this statement that you claim is an ad hominem attack down by the clauses:

          1) Bigshot psychologist,
          2) unhappy when his famous finding doesn’t replicate,
          3) won’t consider that he might have been wrong;
          4) instead he scrambles furiously to preserve his theories.

          None of those clauses separately or together carry the form of “you’re stupid therefore your argument is invalid.” As the quote illustrates, it is not an ad hominem attack to present a cogent argument in a derisive manner, which for the record I don’t believe Andrew did here.

          But you do veer close to Cardinal Aringarosa territory to claim you’re under an ad hominem attack and yet provide no substantive reply to your critics. Facile appeals to authority (like James and Darwin) do not address the points summarized and linked to by Ben in this thread and the points raised in the comment above by Andrew.

        • +1 to AnonAnon

          The highlights of Fritz Strack’s contributions here have been a Latin quip pissing contest, an assertion that he has thick skin followed by a demonstration of thin skin, and a refusal to address any of the actual criticisms of his article (e.g. those of his argument that “one might ask about the determinants that caused nine teams to replicate the original findings and eight teams to obtain results in the opposite direction”, or his citing of Popper when arguing that we should treat a failed critical test as uninformative). The closest we got was a list of studies purportedly demonstrating the pen in mouth effect, but he won’t answer any of the criticisms that followed, instead responding “I hope you apply these criteria to all research that you encounter” and “…and so on” to two methodological criticisms, and completely ignoring Erickson’s detailed reply regarding the content of the papers cited.

          However he does seem willing to hang around and partake in snarky back and forth banter.

        • I was not not complaining. It was me who was criticised for ad hominem attacks.
          My response was “Sauce for the goose, sauce for the ganter”

  8. I don’t get this analogy to Doyle. Bem did some research experiments according to customary practices, and published some p-values. If you believed in ESP and p-values, maybe you found the paper interesting. Nothing was faked. Doyle did not do any research, and apparently he was tricked by fakery.

    Bem’s paper seems to have been a useful contribution to the literature. He showed what could be done with generally-accepted practices in the field.

    • Roger:

      The connection is as follows:

      Doyle thought those photos represented real evidence of fairies, but to a modern eye it’s obvious that those photos are no evidence at all.

      Bem thought his experiments represented real evidence of ESP, but to a modern eye his paper is laughably bad and represents no evidence at all.

  9. “I do think that paper is terrible. Just to speak of the analysis, the evidence is entirely from p-values but these p-values are entirely meaningless because of

    forking paths. The number of potential interactions to be studied is nearly limitless, as we can see from the many many different main effects and interactions

    mentioned in the paper itself.”

    Of course, Bem says there was no possibility of forking paths, because his hypotheses were fixed in advance. Generally, he claims there was a single hypothesis for

    each of the experiments reported in his paper. For that reason, he doesn’t report alternative “main effects”, only secondary effects such as gender differences, or

    correlations with the “stimulus seeking” measure.

    You may think he is lying about this, but my problem with this explanation is that for most of the experiments reported in his paper, there does appear to be a

    single obvious hypothesis corresponding to a time-reversed version of a well known conventional effect. Granted, in the case of Experiment 1, there is obviously

    some scope for alternative hypotheses. But for the others, I find it hard to see these supposed forking paths.

    To take a concrete example, in Experiment 2, the subjects were asked which of a pair of images they preferred, then one of the images was randomly selected as the

    target. If the subject had chosen the target image, a positive image was displayed subliminally; otherwise, a negative image was displayed subliminally. The psi

    hypothesis was simply that the subjects would pick the target more often than the non-target.

    The p value Bem obtained from the corresponding null hypothesis – that in which the target and the non-target were equally likely to be picked – was 0.009. The

    “multiple hypothesis” explanation would require him to have had a set of essentially independent null hypotheses to choose from. I don’t see what those hypotheses

    would be. In fact, given the experimental protocol, the only sensible null hypothesis I can think of that could be applied to the whole body of subjects, rather

    than a subset, is that the target and non-target were equally likely to be picked.

    There may well be a non-psi explanation, or combination of explanations, for Bem’s results. There might be elements of selective reporting or optional stopping, for

    example. But I think there is limited scope for “multiple hypotheses”, and I find it strange that this idea has been embraced so unquestioningly as the obvious explanation for Bem’s results.

    • Chris:

      See section 2.2 of this paper for discussion of some forking paths in the Bem paper.

      You write, “Of course, Bem says there was no possibility of forking paths, because his hypotheses were fixed in advance.” First, I don’t know why you say “of course,” given that Bem was on record as giving the advice to researchers to keep coming up with new hypotheses until something good is found in the data. Second, as we’ve discussed in various places over the year, a scientific hypothesis can correspond to the negation of many different random-number-generator models of the world. Indeed, the whole point of our forking paths paper is that multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research
      hypothesis was posited ahead of time. Bem’s paper featured large numbers of potential comparisons in addition to a lot of actual comparisons performed on the data. In my judgment his published experimental results provide roughly zero evidence of anything at all. And it’s not a good sign that preregistered replications of Bem’s claims failed, and when he wrote a paper claiming successful replications, it was full of un-preregistered experiments, sometimes on completely different topics, sometimes published years before his paper that they were supposedly replicating.

      • Andrew

        Thanks for replying so quickly.

        I take your general point that if a scientific hypothesis is formulated in a fairly abstract way, then there can still be a choice of different statistical tests (so maybe I should have said “multiple tests” rather than “multiple hypotheses”). I’m just not convinced this applies to most of the experiments Bem presented in that paper (setting aside his Experiment 1). As I said, taking the concrete example of his Experiment 2, I don’t see any obvious alternative statistical test that could be applied to the whole body of subjects, other than testing whether the number of “hits” was significantly different from 50% on the null hypothesis.

        Maybe I’m wrong, but it would help me to understand if the proponents of this explanation could actually give examples of the kind of alternative statistical tests they are thinking of.

        You refer to section 2.2 of your paper. That deals almost entirely with Bem’s Experiment number 1, which as I said previously does allow scope for alternative hypotheses, because the subjects were split into several subgroups for which different classes of images were used. But I think most (if not all) of Bem’s 8 other experiments aren’t open to that criticism. Your other suggestions essentially involve splitting the trials into other subgroups – early versus late, and men versus women. And actually, I think all your suggestions involve multiple scientific hypotheses, not multiple statistical tests of the same hypothesis (except for the question of whether Bem should have used a two-tailed, rather than a one-tailed, test).

Leave a Reply

Your email address will not be published.