Skip to content

Bigshot psychologist, unhappy when his famous finding doesn’t replicate, won’t consider that he might have been wrong; instead he scrambles furiously to preserve his theories

Kimmo Eriksson writes:

I am a Swedish math professor turned cultural evolutionist and psychologist (and a fan of your blog). I am currently working on a topic that might interest you (why public opinion moves on some issues but not on others), but that’s for another day.

Hey—I’m very interested in why public opinion moves on some issues but not on others. I look forward to hearing about this on that other day.

Anyway, Eriksson continues:

I want to alert you to this paper published yesterday in Frontiers in Psychology by Fritz Strack who published the original finding on the effect on happiness of holding a pen in your mouth to force a smile. This very famous result recently failed to replicate, and Fritz Strack is angry that some people may interpret this as the original effect not being valid. Rather (he argues) people who run replications lack the expertise and motivation to run them as well as the ground-breaking researchers who publish first on the topic. In support of his view he even cites Susan Fiske:

In their introduction to the 2016 volume of the Annual Review of Psychology, Susan Fiske, Dan Schacter, and Shelley Taylor point out that a replication failure is not a scientific problem but an opportunity to find limiting conditions and contextual effects. To allow non-replications to regain this constructive role, they must come with conclusions that enter and stimulate a critical debate. It is even better if replication studies are endowed with a hypothesis that relates to the state of the scientific discourse. To show that an effect occurs only under one but not under another condition is more informative than simply demonstrating non-effects (Stroebe and Strack, 2014). But this may require expertise and effort.

I have two problems with Strack’s argument.

First, he privileges his original, uncontrolled study, over the later, controlled replications. From a scientific or statistical standpoint, this doesn’t make sense, for reasons I explain in my post on the time-reversal heuristic.

Second, he’s making the common mistake of considering only one phenomenon at once. Suppose Strack, Fiske, etc., are correct that we should take all those noisy published studies seriously, that we should forget about type M and type S errors and just consider statistically significant estimates as representing true effects. In that case, every study is a jungle. Sure, Strack did an experiment in which people’s faces were being held in smiling positions—but maybe the results were entirely driven by the power pose that the experimenters where using, or not, when doing their study. Maybe the experimenter was doing a power pose under one smiling condition and not under the other, and that determined everything. What if the results all came about because one of the experimenters was ovulating and was wearing red, which in turn decisively changed the attitudes of the participants in the study? What if everything happened from unconscious priming: perhaps there were happiness-related or sadness-related words in the instructions which put the participants in a mood? Or maybe everyone was happy because their local college football team had won last weekend—or sad because they lost? Perhaps they were busy preparing for a himmicane or upset about their ages ending in 9? You might say that such effects wouldn’t matter if the experiment was randomized, but that’s not correct in a world in which interactions can be as large as main effects, as in the famous papers on fat arms and political attitudes (a reported interaction with parents’ socioeconomic status), ovulation and clothing (interaction with outdoor temperature), ovulation and voting (interaction with relationship status), ESP (interactions with image content), and the collected works of Brian Wansink (interactions with just about everything).

In the full PPNAS world of Strack, Fiske, Bargh, etc., we’re being buffeted by huge effects every moment of the day, which makes any particular experiment essentially uninterpretable and destroys the plan to compare the results of a series of noisy studies as an “opportunity to find limiting conditions and contextual effects.”

So, sorry, but no.

From a political standpoint there could be value in a face-saving “out” which would allow Strack to preserve his belief in the evidentiary value of his originally published statistically significant comparison, even in light of later failed replications and a statistical understanding that very little information is contained in a noisy estimate, even if it happens to have “p less than .05” attached to it. But from a scientific and statistical point: No, you just have to start over.

Here, I’ll call it like I see it. Perhaps others who are more politically savvy can come up with a plan for some more diplomatic way to say it.


  1. Jordan Anaya says:

    By listing all of the possible causes of an effect you inadvertently explained the replication crisis in biology. Most results in biology are “real” in the sense that they can be replicated by the person who obtained them, if they use the same reagents and same protocol. Whether another lab, or even a person within the same lab, can get the results is a different question, which is largely due to long, complicated experiments with poorly documented protocols. “Oh, those cells needed to thawed on ice instead of room temp!”

    I think a cooking analogy works best. I started brewing kombucha. Let’s say I want good tasting kombucha, like that in the store. Kombucha isn’t complicated, it’s just sugar, water, tea, and a SCOBY.

    Ok, so what tea do I use? Where do I get the SCOBY? What brewing temperature is optimal?

    For the tea I’m just using whatever I already have in my cupboard. I’m really interested in if different SCOBYs brew differently, so I got my hands on 3 different ones. For the temperature I don’t have central cooling, so that’s out of my control (this is analogous to a lab with poor equipment).

    Was my first brew exactly like the one in the store? Of course not. But over time hopefully I’ll be able to play with the variables to get the best brew. That’s how biology works. Everyone knows it will take time to troubleshoot experiments, and that the results likely won’t look as good as the paper they are trying to reproduce. Is this a replication crisis? I would say it’s just indicative of a need for open science and thoroughly documenting protocols.

    Note: This is in no way a defense of psychology, which has a severe reproducibility crisis. In biology you play with variables until you get the result you want, a result which is reproducible. In psychology you collect as little data as possible and stop once you hit .05, which can’t be replicated by others, and likely not even the researchers themselves.

    • Martha (Smith) says:

      “In biology you play with variables until you get the result you want, a result which is reproducible.”

      My experience in reading biology papers is that the description of the protocol for running the biological procedure is very precise — but the explanation of the statistical methodology is very vague (often to the point of “We used SAS Proc X”, with no further explanation). So the biological procedure may be reproducible, but the published paper may give no clue as to reproducing the statistical analysis.

  2. > some more diplomatic way

    I’d say that Strack’s reaction confirms the contextual effects of a power pose.

  3. Darf Ferrara says:

    Strack’s response does sound angry. He should put a pen in his mouth.

  4. Jonathan (another one) says:

    Protection of professional reputation is the biggest thing, but don’t discount the fact that many, many scientists are like the sportswriters that Bill James discussed when he noticed that they use statistics not to learn, but to browbeat unbelievers with some numbers. (I always liked that pun… Statistics use numbers, so that, once numb, critics won’t be sharp enough to see the other problems.) Strack had a hypothesis. He thought it was true before he did the experiment (otherwise he wouldn’t have wasted his time.) He got the result he expected. Case closed. What does someone else not getting the effect he got have to do with his brilliance in the original hypothesis? Nothing!

    • Garnett says:

      I guess it makes sense if you contrast the point of view of the statistician to that of an investigator. Most investigators work on a relatively limited variety of topics, especially funded topics, in their career. A statistically significant effect must be viewed as a miracle and pointing to only one possible truth.

      Most statisticians work on a huge variety of topics in their career, including topics covered in graduate school. I imagine that, for them, a significant result is not a career-making phenomenon if particularly important at all!

    • Martha (Smith) says:

      Thanks for expanding my pun vocabulary.

  5. Matt Skaggs says:

    Second, he’s making the common mistake of considering only one phenomenon at once…”

    Loved this paragraph! But seriously, who does this? Well, I know one answer, engineers do it. It seems to me that the fatal flaw of NHST is that the method cuts obliquely through the knowledge already at hand. We typically can think of multiple potential causes for an observed phenomenon, as your paragraph rather snarkily points out. If a formal cause tree approach is used, evidence gained through experimentation is mapped across the entire cause tree, perhaps supporting multiple root causes, but also refuting others. Knowledge is best advanced after all evidence is mapped against all possible causes.

    Imagine a world in which researchers can only get their work published if they show it in context in a cause tree, including how it influences each and every possible root cause. That is not to say that every researcher has to build their own cause tree, but everyone has to either build a new one or show how their work affects the consensus cause tree on the subject. If you are working on something novel like holding a pen in your mouth to make you happy, you gotta make your own cause tree!

    • Dzhaughn says:

      I like it. But one difference between figuring out why a rocket blew up and why Strack’s subjects felt happier is that there is no need for the engineers to find a ground-breaking, paradigm-shifting, click-baiting new explanation. Because rockets exploding sell themselves.

  6. Anonymous says:

    Stage 1: denial.

    I wonder if Strack will get to the later stages in his lifetime…

  7. Anonymous says:

    One of the big problems is that these psychologists do not understand sampling error and randomness. They may have a large effect with ‘luck’; note that p-hacking only is a problem if sampling error or randomness affects the outcome. As in psychology. Two final and related points:

    1) Boundary conditions or randomness are both explanations for a larger effect in the original study than in the replication study. In psychology, with small effect sizes and small samples, the most likely explanation is this: the published study has been subject to a selection process, selecting larger and statistically significant effects, and this selection process is not present in the replication.

    2) Of the two studies, we should trust the results of the replication study MORE than of the original study. So yes, there is an asymmetry. We should trust the replication study more, because this selection pressure was absent.

    Oh yes, these psychologists do not understand philosophy either. Boudary conditions make a lot of sense, but they should be derived from theory and be stated in the original article, and not after the replication by the original researchers. See Lakatos.

    Afterthought: I hope that in hundred years the current crisis and ‘these psychologists’ are just famous because of the errors in statistics and philosophy they made, and not because of their research. But this may well not be happening…

  8. Anon says:

    I think I first heard about this article from one of Retraction Watch’s weekly roundups. I remember that a part of the conclusion really stood out to me:

    > As things stand now, I am not optimistic about the impact of the Registered Replication Reports on the field. Strong effects will be replicated, weak effects not. If this incentivizes researchers to pursue “strong” effects rather than theoretically informative ones, it may shift the field into a more applied direction and away from theoretical innovation. And as long as the outcomes are not embedded in a critical debate, they are seen as final verdicts on an “effect” without a clear message on any underlying process. Moreover, non-replications may spread doubt about the integrity of the original research while the public discussion about the percentage of studies that cannot be replicated does not add to the reputation of our field (e.g., Johnson et al., 2017).

    I find several problems with this.

    1. Strack is against registered reports because he believes it will encourage researchers to pursue replicable effects rather than “theoretically informative” ones. Wow, I can’t help but read this as Strack saying that even if he knew a paper was unreplicable, as long as it was “theoretically informative/innovative,” he would prefer it to a paper that was replicable but somehow not “theoretically informative.” Frankly, I don’t even know what it means for a paper that can’t be replicated to still be “theoretically informative.”

    2. Strack seems to conclude the paragraph by saying that registered reports are bad for the field if they lead more journalists to write negatively about the field. Well, who is responsible for the bad reputation of the field? The professors or the journalists? Obviously we professors are responsible! And the bad reputation exists whether journalists write about it or not!

    • Martha (Smith) says:

      I think he gives insight into what I see as a big problem with the field of psychology: That a lot of psychologists care more about the theories that they devise (and want to believe) than in empirical evidence. That makes their type of psychology like religion or philosophy, but not a science.

      • Kyle MacDonald says:

        That’s possible. A more charitable but still plausible reading (at least of Strack’s comment quoted by Anon) is that many psychologists are more interested in the general structures underlying mental processes than in specific, concrete effects. And that’s fine! At least to a point, general structures are necessary in order to interpret specific findings, and a couple of surprising findings shouldn’t immediately overturn an organizing framework. (This is a Bayesian blog, after all!) Daniel Kahneman said a few times, “Economists have laws, psychologists have lists.” But laws help to organize lists. We need both hedgehogs and foxes, although academic disciplines, even psychology, are far more likely to suffer from an overabundance of hedgehogs than of foxes. Someone committed to a theoretical approach rather than specific effects is going to respond less to empirical evidence.

        Under this reading, the problem isn’t that researchers in this type of psychology aren’t like evangelical preachers, but that they’re like a teenage boy trying to decide whether to do a backflip while drunk at a party. (I speak from experience.) They do try to have an organizing framework, which is flawed but plausible, and rules for adjusting their beliefs based on new evidence. However, the rules are not good, and they aren’t good at applying them even when it would be helpful.

        • Martha (Smith) says:

          Kyle said, “A more charitable but still plausible reading (at least of Strack’s comment quoted by Anon) is that many psychologists are more interested in the general structures underlying mental processes than in specific, concrete effects. And that’s fine! At least to a point, general structures are necessary in order to interpret specific findings,”

          I view this with what I see as a healthy scientific skepticism/agnosticism: Are there indeed “general structures underlying mental processes”? If you believe there are, what is the evidence — or is this just a philosophical/epistemological stance you assume? And are “general structures” indeed necessary to interpret specific findings? If so, what is the evidence? (I”m not saying you’re wrong, just that I need evidence to be convinced that you’re right.)

          • Kyle MacDonald says:


            Thanks for the reply. My comment was extremely speculative, as is anything I say about psychology — I’m a first-year grad student in statistical physics. I was aiming for the most optimistic possible interpretation of the specific Strack quotation in Anon’s comment, and I should have made that clearer. Generalizability, as Doug discusses below, is why I think that at least a bit of general theory is important, but it’s probably less important in psychology than in the physical sciences, which is where my biases and intuitions have been formed. Rocks fall pretty much straight down whereas functioning airplanes generally don’t, and interpreting those observations is much easier if you already have some ideas about air pressure. But I’m perfectly willing to back down from any suggestion that the situation is analogous in psychology.

          • Glen M. Sizemore says:

            Are there indeed “general structures underlying mental processes”?

            GS: What does this even mean?

            • Kyle MacDonald says:


              With “general structures underlying mental processes”, Martha was quoting my very vaguely worded comment, in which I tried to give the most charitable interpretation possible of Strack’s words as quoted by Anon:

              “If [the focus on replication] incentivizes researchers to pursue “strong” effects rather than theoretically informative ones, it may shift the field into a more applied direction and away from theoretical innovation. And as long as the outcomes are not embedded in a critical debate, they are seen as final verdicts on an “effect” without a clear message on any underlying process.”

              My point, badly articulated, was that it could be valid to care more about the “underlying processes” that Strack mentions, under the charitable assumptions that these exist and can be studied productively, than about specific, isolated effects. Under the same charitable assumptions, Strack could be justified in his concern about a focus on specific effects that do replicate but don’t teach us much about how the mind works. He could also be right that small or weak effects, which sometimes will fail to replicate just because they’re small, could provide important insights. All of this would be consistent with a viewpoint far removed from religion.

              Stepping back from the charitable assumptions, though, I’m not defending bad empirical research, and if you don’t actually care whether your empirical work is accurate, you shouldn’t stand by it. As Andrew puts it in his unpublished paper on incrementalism,

              “The work should stand or fall based on its qualitative contributions, with the experimental data explicitly recognized as being little more than a spur to theorizing.”

              (To clarify, Andrew was not talking about Strack’s work in that paper, but about a 2016 publication by Burum et al in particular and about the problem of replicating small effects in general.)

      • Doug Davidson says:


        Speaking as a psychologist, I think the problem is not quite as severe as you theorize, but is still a very big problem. A lot of theories are based on effects that have been demonstrated with experiments optimized for detection, but not generalization. I think that is all you really need to explain a lot of the lunacy that we have been discussing.


        • Stephen Martin says:

          Also speaking as a psychologist [in training], I take issue with this.

          1) Theories [about psychological functioning] are demonstrated with experiments designed for detection, but can’t be generalized? Can they then just not be theories about psychological functioning? If a theory about psychological functioning doesn’t generalize, then it doesn’t seem to be a theory about psychological functioning, but rather context. Which is FINE, if the theory is actually meant to be testing contextual phenomena, but I don’t see any theories stating “We think this psychological effect only exists for X people in Y region when observed by Z researchers”.

          2) Even so, the experiments are optimized for ‘detection’ how? Well, the procedure generates some quantity which is improbable under H0. That quantity, if it doesn’t generalize at all, isn’t really detecting anything then. It’s just a noise quantity.

          So actually, I think it is quite severe. If theories about psych functioning don’t generalize, then the effects aren’t driven by only psych functioning, and we see that in the RRRs. If experiments are designed for detection, but then direct replication attempts don’t detect it, then the original experiment is unlikely to have detected anything at all; it was noise, error, analytic ‘flair’, whatever one wants to call it.

          • Doug Davidson says:

            I agree with you on most of your points – the problem is severe. The thing is, it is often the case you need to have some kind of effect to get started – you need something to work with.

            My point is that this shouldn’t be the end product. It should be the *start*. Too many people treat the simplified lab experiments as the end product, when they are really just the initial evidence.

        • Martha (Smith) says:

          Doug said, “A lot of theories are based on effects that have been demonstrated with experiments optimized for detection, but not generalization.”

          I don’t think this addresses what I was trying to say. My impression (I would call it that rather than a “theory”) is that much research in psychology starts out with a theory, then proceeds to perform and analyze an experiment to try to detect evidence for that theory. In other words, the theory comes fist, rather than being “based on” the effects shown in the experiment.

          • Doug Davidson says:

            Thanks for clarifying – I guess I agree with you for the most part – at least in my little corner of the field we don’t have a strong tradition of trying to falsify our own theories (there are usually plenty of helpful colleagues for that!)

      • Bill Jefferys says:

        So, Martha, the psychology of these particular psychologists is part of the problem?

        (And here’s a wave at you from Vermont!)

        • Martha (Smith) says:

          I guess you could put it that way — or, to be more specific, that I believe that a good scientific psychology (mindset) requires a lot of skepticism/agnosticism about theories; an effort to refrain from becoming too emotionally attached to a theory.

          (A wave back from the 99° heat.)

      • Mayo says:

        There is a huge misconception about philosophy. The sophisticated level of argument required to sustain a philosophical theory–in any good philosophical realm–to answer stringent counterexamples and criticisms, is far and away more rigorous than psychology or religion. As some philosophers try to move to questionable empirical testing of philosophical claims (e.g., in ethics), they too lose rigor.

  9. Mark Palko says:

    I hit the tweet button and got an error. The title with the address was 141 characters.

  10. Mark Palko says:

    The replication crisis and the sweeping hypothesis syndrome highlight each other. Researchers like Strack insist that the findings of their small and extremely artificial studies reveal some profound general truth about the world, then insist that the effect is so limited that you have to recreate every condition of the original study exactly to replicate those findings. These claims are not compatible.

  11. Mark Palko says:

    Wait a minute, you never told us about this:

    I’m sure Malcolm Gladwell wouldn’t promote junk science.

  12. Mark Palko says:

    This calls to mind something from Wansink’s cute names for vegetables study.

    For the second version, Strack added a new twist. Now the students would have to answer two questions instead of one: First, how funny was the cartoon, and second, how amused did it make them feel? This was meant to help them separate their objective judgments of the cartoons’ humor from their emotional reactions. When the students answered the first question—“how funny is it?,” the same one that was used for Study 1—it looked as though the effect had disappeared. Now the frowners gave the higher ratings, by 0.17 points. If the facial feedback worked, it was only on the second question, “how amused do you feel?” There, the smilers scored a full point higher. (For the RRR, Wagenmakers and the others paired this latter question with the setup from the first experiment.)

    In effect, Strack had turned up evidence that directly contradicted the earlier result: Using the same pen-in-mouth routine, and asking the same question of the students, he’d arrived at the opposite answer. Wasn’t that a failed replication, or something like it?

    Strack didn’t think so. The paper that he wrote with Martin called it a success: “Study 1’s findings … were replicated in Study 2.” In fact, it was only after Study 2, Strack told me, that they felt confident enough to share their findings with Wyer. He said he’d guessed that the mere presence of a second question—“how amused are you?”—would change the students’ answers to the first, and that’s exactly what happened. In Study 1, the students’ objective judgments and emotional reactions had gotten smushed together into one response. Now, in Study 2, they gave their answers separately—and the true effect of facial-feedback showed up only in response to the second question. “It’s what we predicted,” he said.

    • Jordan Anaya says:

      Wow, that was an interesting article.

      This whole thing reminds me of a quote that I think was attributed to Ray Allen. It goes something like this:
      “I’ve never missed a shot, the ball just doesn’t go in sometimes.”

      What Ray Allen was saying was that he uses the same, perfect form for every one of his shots, and as long as he does that it doesn’t matter if the ball doesn’t go in. Well, of course it matters if the ball goes in, but it doesn’t matter in the sense that Ray Allen has already done everything he can on his end.

      Psychologists seem to be using a similar dogma. They’ve never had an idea that was wrong, experiments just don’t confirm their brilliant ideas sometimes.

      Psychologists are gracing us with their brilliance, it’s just up to us to figure out ways to produce results that accurately reflect this brilliance. PPNAS knows how, Wansink knows how. The rest of us need to get with the program.

      • Andrew says:


        I’ll have to write more about this . . . but just, very quickly, I suspect that one problem with psychology is that people like Bargh, Baumeister, Strack, Fiske, Gilbert, etc., have received a steady stream of adulation for decades. Students hang on their every word, colleagues bring them awards, journalists write adoring profiles of them. Until recently, there’s been nobody on the other side. Even those in the psychology profession who don’t believe their research claims, would pretty much leave them alone.

        We think of Bargh, Baumeister, Strack, Fiske, Gilbert, etc. as being aggressive—but, until recently, they weren’t aggressive at all! They just did their job and basked in the glory. The aggression has all come only in defense of what they must think is their natural and appropriate status.

        I imagine that when they first started encountering criticism of their work, and of the work of their friends, that they were puzzled, and then assumed the critics were confused, and then assumed the critics were motivated by ill will. After all, everyone had been telling them for decades how wonderful they were, so why take seriously the haters who think otherwise?

        Wansink’s a bit different, I think. He comes off in his writing as more of a hustler, and part of his charm is that vibe he gives off, that he’s worked so hard to get where he is today.

        What makes me really sad is the young researchers who’ve jumped on the unreplicable research train, just before it headed off the cliff. On this blog I’ve given them lots of sincere advice (for example, to focus on measurement) but some of them don’t seem to want to hear it. And of course they have the encouragement of some people in the old guard who’d rather not be left completely alone.

        • Glen M. Sizemore says:

          “I imagine that when they first started encountering criticism of their work, and of the work of their friends, that they were puzzled, and then assumed the critics were confused, and then assumed the critics were motivated by ill will.”

          GS: Hmmm…I wonder if we know anyone else like that? Someone, you know, close to home and all.

        • Mayo says:

          Andrew: Your post made me curious to check the article.

          One of the big problems, and I’ve heard researchers in the social sciences admit it, is that the statistical analysis wasn’t ever really important to them, it was just required window-dressing for publications. The idea that facial expressions could alter mood isn’t wildly implausible, it’s the experiments that are problematic.

          A lot of the artificiality of these experiments results from supposing it helps disguise the intended experimental aim, but it often changes the actual manipulation (and students generally know the expected effect anyway).

          Here the experiment “has participants holding a pen either between their protruded lips, which prevents smiling, or between their teeth, which facilitates it. Holding the pen in either of these positions, participants had to perform a series of tasks, among them a rating of the funniness of cartoons. As predicted, the cartoons were rated to be funnier if the pen was held between the teeth than between the lips. The effect was not strong but met the standard criteria of significance.”

          If I try to write something while holding a pen between my teeth (to facilitate smiling) it’s so difficult and aggravating that I can’t imagine my rating of a cartoon-marking the number while holding the pen in my teeth- would reflect how funny I think it is, let alone my mood. If smiling did tend to make me happy, smiling as a result of holding a pen between my teeth, and writing with it, would undo that feeling. This is especially so if I’d already been asked to write other things holding the pen this way before being shown the cartoon-given the pain. Writing while holding it through the “frown” position is easier. (I think Likert scales are fairly meaningless anyway).

          [They should consider interviewing the students discounted from the study because the camera showed they dropped the pen more than once.]

          By the way, the author wonders what “caused nine teams to replicate the original findings and eight teams to obtain results in the opposite direction.”
          I don’t know if this means 9 of 17 were statistically significant?

          One other thing I’ve been seeing lately crops up in his article: author claims that to regard an effect as a statistical false positive is tantamount to denying causal laws are operative. Since causal laws are operative, there can’t be false positives-or so he avers. But this is wrong. If the cure rate of your drug is the same as placebo, it doesn’t mean the individual cures/no cures aren’t caused—only your drug hasn’t increased aggregate cures or the like.

          • Andrew says:


            Indeed, Strack’s article makes an interesting argument, which is to deny the relevance of the concept of randomness at all, even for something as paradigmatically random as the rolling of dice. Strack writes:

            Let us assume we are rolling some dice and come to the conclusion that resulting numbers are produced by chance. This conclusion does not imply that the laws of mechanics have not been operating. It just means that the interaction of the various influences has been so complex that it was not possible to generate a prediction. But in principle, a chance outcome is just as causally determined as any other mechanical phenomenon.

            That’s all fine. But when applying this reasoning to his data analysis, Frack has the big, big problem which he doesn’t seem to recognize, that he’s privileging his own particular story (regarding the pen between the teeth, etc.) over all the zillions of possible explanations out there, including ovulation, birth order, choice of clothing, outdoor temperature, sex of the experimenter, key words such as “Florida” or “bingo” that might’ve been in the instructions, and so on.

            • Mayo says:

              On the first point, I don’t see it as an interesting argument, it’s irrelevant. “However, adherents of the theory of Null Hypothesis Significance Testing seem to assume that chance is causally undetermined. But if it is the task of a scientist to identify the causes of things, it seems highly problematic to assume that some effects are “false” in the sense of being causally undetermined.”
              This would mean there are no false positives because things have causes. In fact,there’s no causal underdetermination assumed. We can allow every single outcome-say whether a person scores a joke funny on a given day– is deterministic. If we knew every factor in the universe, suppose, I could predicted how funny Gelman rates a cartoon. It’s irrelevant to the meaning of:
              your treatment made no difference to funniness scores.
              The stat test is trying to learn about the causal factors by at least ruling this “no effect” hypothesis out, and if it can’t reliably do so, he’ll not even learn if facial position makes a difference.
              On the zillions of possible explanations, isn’t that the point of randomly assigning people to treatments? (I know you mentioned this, but do you really think it helps not at all?) Each person already has a funniness number attached to them for each cartoon, say. The RCT is supposed to let us compute the probability of observing different mean scores (between the 2 pen treatments) merely due to random assignment to the group “wrote with pen held in smiley mouth”. If this can’t be done, it’s hard to see that there’s a valid statistical test to begin with.

              My more serious problem is, as usual, that the study seems to have little to do with what they want to find out. And I see no point to writing your funniness score with a pen held in your mouth, rather than held in hand. (But, then again, I just heard about this the other day on your blog.)

              • Not having read any of it, I’ve gotta believe that they had people hold one pen in their mouth, and use another one held in their hand to fill out the survey… Please someone tell me this is the correct assumption.

          • Ben Prytherch says:

            To answer this question:

            “By the way, the author wonders what ’caused nine teams to replicate the original findings and eight teams to obtain results in the opposite direction.’ I don’t know if this means 9 of 17 were statistically significant?”

            None of the 17 attempted replications were statistically significant in either direction.

  13. Fritz Strack says:

    Thanks for all the attention. But keep your gunpowder dry.
    There is more to come ;-)

    • Andrew says:


      No gunpowder here. Nobody’s shooting anyone. We’re just trying to learn about the world outside of the lab (or the 100 Mechanical Turk volunteers), and we’re trying to explain the limitations of what you can learn from noisy data. No need to fight any battles; just try moving forward. You could start by reading these four papers:

      • Fritz Strack says:


        This is very generous of you.
        Thanks a lot, also for “big shot” which I have already adopted in my Facebbok profile.
        But not to worry, I am quite happy and not at all “furious”.


        • Joel Devonshire says:

          Just want to point out something in a “meta-dialogue” that I think both psychologists and statisticians alike should probably find of importance, as it pertains to the relationship between observation and evaluation. Let’s consider the title of this blog post (which, predictably, has elicited a response from its target):

          “Bigshot psychologist, unhappy when his famous finding doesn’t replicate, won’t consider that he might have been wrong; instead he scrambles furiously to preserve his theories”

          The intention of the blog post seems to be to make some observations (or commentary) on some unfolding events in the social psychology literature. But even as I look at the title, I find a mixture of observation and evaluation. This mixture is implicit and never directly addressed in the subsequent post. But I think it’s significant since it likely effects how the message will be received by various readers. Let’s take a look at some of the specific words:

          Bigshot. Unhappy. Famous. “Won’t consider.” Scrambles. Furiously. Preserve.

          Out of 25 words, eight of them (about 30%) seem to contain evaluations. The problem with mixing evaluation and observation is that it sends an obscure message, and one that is likely to evoke defensiveness in the listener. It focuses on right vs. wrong while simultaneously hiding many of the assumptions of the speaker, shifting responsibility (of views, feelings, etc.) from the speaker to target. Evaluations often assume intent. They often have an essentialistic quality (e.g., “Bigshot,” “Methodological Terrorist,” etc.). I have become convinced, largely through the work of clinical psychologist Marshall Rosenberg, that the failure to distinguish and separate our observations and evaluations is a highly problematic feature of our everyday discourse, and contributes to a kind of “violent” language. Others might call this kind of thing “snark” or “sarcasm” and be mildly amused by it. But my evaluation would be that it represents a kind of “lazy” use of language. I think someone else in these comments alluded to this by referring to the mutual feeling that the “other side” is being aggressive.

          I think keeping this in mind when publishing in public forums is probably a good idea, since doing so can contribute to more effective, civil discourse. And I say this as a graduate student who is heartened and glad by the sea change in the field of social psychology toward more robust methods and sophisticated statistical analysis. But ALSO as one concerned about the profusion of unproductive arguing and bickering in society today.

          One suggestion is that instead of continuing the use of aggressive language, improve the level of discourse by using methods in which one’s thoughts, feelings, observations, and requests are as carefully parsed out as possible (or at least by acknowledging the failure to do so, as I’m sure I fail to completely do in this comment).

          • Andrew says:


            I’ll take that as a response to the request in the last paragraph of my post. So thank you. I believe in division of labor. I find it hard enough to figure out what I think; it takes all my effort to express my own views clearly. So I’m glad there are people such as you who are interested in putting in the effort to re-express things in a constructive way.

            One difficulty I had with Strack’s article is that from a scientific and statistics perspective it’s so bad, so uninformed, that in some way the most interesting question about it is how it got published in the first place.

            • Joel Devonshire says:

              But the natural give-and-take in this interaction would be for Dr. Strack to indeed read and consider your statistical publications and arguments you’ve shared, and for you in turn to acknowledge that you were indeed “shooting someone” by calling him a “Bigshot psychologist.” In this way, each party can own a part of the responsibility for a productive way forward. Unfortunately, as in most arguments, most of the words will probably fly over the heads of all parties concerned.

              There are some areas of life in which division of labor is simply not a mature perspective.

              • Andrew says:


                No, I’m not “acknowledge” that calling someone a bigshot is the same as shooting someone. I hate this kind of verbal escalation. If you would like to write that way, it’s your call, but you’ll find it difficult to communicate with me: to me, all hear talk of terrorism, gunpowder, etc., is a distraction from the more interesting scientific issues. But, if you want to write that way, it’s your call.

                I’m a statistician and have done a lot of work in recent years on statistical issues related to the replication crisis. I don’t think I would’ve been able to do this, nor do I think I’d be able to continue to do this, if I were not free to say directly what I think, and to engage with people such as Nick Brown, Jordan Anaya, Anna Dreber, Brian Nosek, and the many other people who have put politeness aside in order to directly convey their findings.

                In any case, I encourage you to pursue the course of action which you think will be most useful.

              • Martha (Smith) says:


                I hope you don’t take this the wrong way, but I think it would be helpful if you would try to parse out your reply to Andrew’s reply to your prvious comment in terms of what you were talking about in your previous post — namely, what in your reply do you consider observation and what do you consider evaluation?


                1. “But the natural give-and-take in this interaction would be …” I would call this an opinion. To you would it be an observation, an evaluation, an opinion or some combination of those, or something else entirely?

                2. “There are some areas of life in which division of labor is simply not a mature perspective.” Same comment/question.

              • There’s also the distinct possibility of say misunderstanding words due to idiomatic usage that non-native speakers might misunderstand for literal usage. The word “bigshot” or “big shot” is idiom, the definition given by googling “define bigshot” is “(informal) an important or influential person” which accords with exactly the way I read it.

                so, there’s nothing about gunpowder or shooting involved in the modern usage of this word (though historically it may have come from calling someone a very good marksman, or comparing them to a high caliber large, influential type of gun.. I don’t know).

                There is, perhaps, a connotation that a “big shot” is someone who is influential through intimidation or intentionally using their own reputation to give themselves power over those without similar reputations, but it’s an ambiguous kind of connotation.

              • Joel Devonshire says:

                I’m not sure how nested replies work here, so I’ll reply to both Andrew’s and Martha (Smith’s) replies.

                First, Martha (Smith): Both of the examples you raise in my writing clearly contain evaluative language. Words like “natural” and “mature” are dead giveaways. And, yes, being explicit about it all the time is very challenging and can make for difficult communication (which is why I fail often). But, in my opinion, the difficulty of obtaining the ideal does not detract from its usefulness or alignment with what I discern as principles of peaceful dialogue. These are the same kinds of principles that I try to instill in my five-year-old son. And, yes, they are clearly my own evaluations.

                Then, Andrew: I’m not trying to tell you how to speak/write, but rather point out a principle that is both pragmatic and psychological. On the pragmatic side, I think you would find your arguments hitting home more successfully among the audience that needs it the most if you were sensitive to the ad hominem fallacies contained within them. Second, on the psychological side, I don’t think any reasonable person is trying to equate “impoliteness” with physical violence. But having just returned from a silent 10-day meditation retreat, I can certainly attest from personal experience about the deep connection between the way we speak and the relative peace/violence within us. It’s certainly true in my life and I thought you might appreciate my attempting to point it out. But you are obviously free to continue working however you choose, and I will comment no further on this issue.

              • Keith O'Rourke says:


                I found when I worded things in a non-confrontational manner when asking questions of speakers they often used this to avoid addressing the concerns I was trying to raise.

                So “politeness aside in order to directly convey their findings” does ring true to me – some evaluation of the speakers arguments may often required to get a response.

  14. EJ Wagenmakers says:

    There is an interesting new development here. I recently read on Twitter that Tom Noah and his advisor have conducted an experiment that purports to show that the effect is present when there are no cameras to monitor whether the pen is held correctly. I have not seen the data and I find the hypothesis (i.e., that the cameras somehow make the effect disappear) implausible. I’m also not sure how many other experiments have been conducted that do not find the result (and we don’t learn about). But if I can’t find a serious flaw in the design I’ll certainly propose an adversarial collaboration to sort this out. The effect is in most psych textbooks so it deserves to be looked at from all angles. I have been told that Tom shared his materials on the OSF, and I’m excited about such a high level of transparency.

    • Anonymous says:

      “I’m also not sure how many other experiments have been conducted that do not find the result (and we don’t learn about)”

      This perhaps also applies to all the follow-up studies looking for moderators as a result of high profile RRR’s. Perhaps there have been dozens of studies performed already aiming to find moderators. Perhaps the “no camera”-study that you mention has been performed countless of times. We simply do not know…

      If the study you mention was performed as a Registered Report where the study is published regardless of how the results turn out (and hopefully with public access to pre-registration information!!) then i would put some faith in it. If this is not the case, i can only hope that someone will replicate it very soon.

      To me, this could potentially be a very useful case-study as to why performing research where journals and/or researchers can decide which results to publish after they are known is useless and (contributes to) a waste of resources…

      • Alon Goldstein says:

        I happened to hear Tom Noah’s talk and see the results.
        Unlike what implied here (and in Twitter), it’s not just a random guess that lead her (it’s a she, btw) to run these 200 participants in a 2×2 study.
        Her group had recently showed other findings about how people act with and without cameras\mirrors etc. Their main practice is studying the changes in behaviors that arise from being the observer vs. being the observed, so this specific study (that was pre registered and will be available in the OSF) stemmed from grounded theory (that they presented in the previous EASP general meeting, a couple of years before the 2016 replication was published).

        I think it’s also important to mention that their study replicated BOTH findings – they didn’t just hypothesize regarding the camera, it was an IV: Half the Ps had a camera in the room and the exact instructions gave by the replication team (including diverting attention to the camera), and the other half did not have a camera, and the instructions lacked that part (all instructions were on-screen, so no experimenter was involved).
        The first half did not show any effect, but the ones without the camera had shown the original one.

        So it’s unfair to claim that “one side gives more weight to the original findings and ignore the replications” when we have both the theory and the data to support the claim that these 2 exps differed from one another.

        • Fritz Strack says:

          I am curious to see Gelman’s response.

        • Erikson says:

          The replication study doesn’t address this, but the original study has some flaws that should have raised a red flag. The most obvious is that, in Study 1, the statistical significance filter is only obtained by an undisclosed linear contrast that made use of all subjects (judging by de number of degrees of freedom) tested by a single-tailed t statistic. I have no problems with linear constrasts or single-tailed tests, but it has become more and more obvious how those researchers degrees of freedom allow for (deliberate or naive) manipulations of the results — like non-significant ANOVA turned significant by applying a contrast and using a single-tailed p-value. And there’s the issue of the reliability of single item scale.

          Study 2 relies on the discriminant validity of both questions used to infer the two proposed aspects of humor. What if the observed effect was similar to the first study, i.e., that ‘funny’ ratings were greater in the ‘teeth’ condition? Would it count as a replication, since the words in the scale were similar? Or should we trust that both items indeed are able to discriminate between affective and cognitive aspects? So, no exact replication anyway, but even if we can accept that what was ‘funny’ before now means ‘amused’, the ‘effect’ is confirmed by a p-value of 0.08, halved by the single-tailed trick. And how do we interpret the effect reversal in the ‘before rating’ condition? I’m pretty sure a smart psychologist can come up with a myriad of explanations.

          The replication study, on the other hand, has a pre-registered analysis plan and 17 independent teams replicating the first study with usually larger samples. All effects were smaller than the original study and the meta-analysis suggests an almost null effect. Should we trust an analysis with many researcher’s degrees of freedom or 17 replications with pre-registered analyses? But no, maybe it was really the camera all along.

          But what makes me truly sad about it all is the scientific aspect. The causal mechanism behind the facial feedback hypothesis sounds quite sketchy to me, like most psychological hypotheses. Strack et al. study was important because it seemed to provide evidence in favor of the hypothesis without the possible contaminations of the earlier studies. The effect was almost non-existant in the original study; it turned to dust in the replication. Now, making Lakatos turn in his grave, all sorts of mediators and interactions are posited to save a really poor result from the original study. The camera? Fine, it makes sense. We can make up enough theory to justify that. But we could also argue that self-perception is the key behind the mechanism, as Laird proposed, and given that self-awareness is usually increased as one is observed, the effect should increase in the camera situation. Why not? All sorts of mediators, as Andrew points out in his post. Or we can just argue that this precise experiment has no relation whatsoever with the propsoed hypothesis and it can’t really test it at all. Or maybe the facial feedback is not such an important mechanism, if it exists, even though Darwin and James sketched it a long time ago.

          How is psychology supposed to move forward? As a psychologist, I really think the replication efforts are the way forward, to clean the field and demand higher quality evidence before corroborating some shady hypothesis. But without better measurement, really well-defined causal mechanisms and experimental designs that put a proposed hypothesis to severe testing, I fear we might end up with a clear field with nothing to plant in it.

          • Martha (Smith) says:

            +1 to Erikson’s last paragraph.

          • Andrew says:

            Yes, I agree too. That’s why when criticizing various studies such as in that ovulation-and-clothing paper, I typical don’t recommend preregistered replication. Instead I recommend more carefully thinking about what is being studied, and taking more direct measurements of the object of study.

          • Fritz Strack says:

            During the past 6 years, at least 20 studies have been published demonstrating the predicted effect of the pen procedure on evaluative judgments (for a selection of relevant publications since the year 2000, see
            I know, they were not “preregistered” and can therefore be neglected.

            • Anonymous says:

              As a junior researcher following recent events in psychology, i have come to the (provisional) conclusion that the fact that 20 studies have been published demonstrating the effect of the pen procedure tells us close to nothing.

              This is because, as i understand it, psychologists and/or journals have set up a system which let them decide which studies to publish based on the results, which distorts any attempt at correctly summarizing evidentiary value of studies related to a certain effect/phenomenon/theory.

              If this reasoning is (largely) correct, i find it fascinating that you seem to not find this situation a problem in interpreting research results. Am i wrong in this?

              p.s. As i understand it, merely “pre-registering” doesn’t solve this problem as pre-registered studies can also be hidden. Only with “Registered Reports” there is no hiding of studies based on the results.

              • Fritz Strack says:

                I hope you apply these criteria to all research that you encounter.

              • Keith O'Rourke says:

                > If this reasoning is (largely) correct
                Given my work in meta-analysis of clinical trails (187 to 2007) I believe it is correct in academic research.

                In various areas of regulatory (e.g. FDA) I do believe they can prevent studies from becoming jointly un-interpret-able by usually disregarding studies that were not previously discussed and agreed upon before they were conducted, demanding all the raw data they wish to review, demanding all protocols, demanding alternative analyses be performed, request multiple studies and having legal penalties for not reporting problems that arose in the research.

                Not to suggest that this is always done and legal penalties prevent sloppy or bad work – but I think it makes most of research evaluated credible interpret-able.

            • Keith O'Rourke says:


              If one could find about how many studies were done and retrieve their protocols and raw data – then it should definitely not be neglected.

              Given those unknowns, any attempt to make sense of the write ups of those 20 studies is overly hopeful.

            • Ben Prytherch says:

              “During the past 6 years, at least 20 studies have been published demonstrating the predicted effect of the pen procedure on evaluative judgments.”

              At least 29, if you count Wagenmakers et al (2016).

            • Erikson says:

              I would like to know what does it mean to ‘demonstrate the predicted effect’. I certainly wasn’t able to read all 20 studies, but some which I read didn’t really replicate the main effect from the original study.

              Adréasson (2010) has 4 experiments; in experiment 2, the main effect of experimental condition (sulky vs. smiling) was not significant and the funniness ratings were in the opposite direction. There is an interaction to save the day, though. But how do you make up a theory to explain why people with lower empathy scores have the reversed-facial-hypothesis effect? In experiment 3, the main effect is in the opposite direction! But experiment 4 saves it all by using a different strategy do induce frowning or smiling. Do those experiments count as ‘wins’ in demonstrating the effect?

              Lobmaier et al. (2015) have a different experiment – no funny ratings at all. The pen condition main effect is not statistically significant, but the interaction between condition and sequence is. When the analysis compared the condition between themselves directly, though, only one comparison was statistically significant, which seems well demonstrated in Figure 1. How do we deal with those plethoras of p-values that could be interpreted in both directions?

              In Sel et al. (2015), the instructions were clear about the emotions conveyed in each experimental condition, which again might lead to bias in each situation. But the happiness results were not significant at all for self-emotion and no interaction to save the day this time. The results from EEG are a little more positive, but no main effect for self-emotion and a very tenuous interaction between self- and other- condition. What counts as a demonstration in this case? No effect in the behavioral assessment, but weak (and forked-path analyses) in the EEG results should count as ‘win’?

              Really, how do we interpret all those crazy results? The ‘camera is a moderator’ kind of psychologist might argue that the theory has enough successful predictions in the science bank and question the experimental results; those wishing for more evidence might argue that we have a lot of noise in those experiments and we can’t really conclude much from each one of them, and demand better evidence. But can we still count it as ‘demonstration of the predicted effect’?

          • Glen M. Sizemore says:

            This is what psychology needs to be a natural science:

            1.) Take conceptual analysis seriously, thus paving the way to overhaul the ridiculous assumptions underlying mainstream psychology (i.e., representationalism, “information processing” – same thing…).

            2.) Promote experiments that can uncover facts that can be demonstrated in individual subjects – you know, since behavior is something relevant only to individual subjects. Single-subject designs would also eliminate the replication crisis.

  15. Jonathan says:

    Let’s look for some value in these ‘studies’. I can come up with a few. They all devolve to versions of ‘the power of positive thinking’: if you need to smile, stick a pen in your mouth (that autocorrected to penis, btw, but I’m not gay so I changed it to pen). If you need to stroke that putt smoothly, visualize stroking that putt smoothly. Etcetera. If this were science fiction, we could imagine a ‘mechanism’ – call it an Asimov machine – that would select the ‘positive’ reinforcement you need at that moment. (On a darker note, Philip Dick – not another penis reference – might imagine a drug doing that.) Thing is, does anyone not actually use the Asimov machine? We all have ways of putting ourselves in a better frame of mind, of calming nerves, etc. I can accept that these exist without believing they are replicable because they occur. But they don’t occur like Asimov might imagine; they are not regularly effective and they don’t occur when needed. One can look at the former and see a replication issue but the latter is one too: an ‘effect’ that occurs sometimes needs a specific pathway, as in biology (although anyone who understands a little about referred pain knows the pathways are often hidden!) and the pathway can’t be that it shows up ‘whenever’ or if it requires a magical step (as in, you not only super-dilute a solution but then you activate its memory by whacking it 6 times against a specific leather surface).

  16. David18 says:

    The “Moneyballization” of fields where engineering rigor supersedes intuition and “feelings” is hugely disruptive. Recall Michael Lewis’s book/movie Moneyball where the baseball scouts and managers were very angry about the use of data which proved to have much better results than their entire careers of intuition. Many experiments are not well constructed and should have qualified statisticians involved at the design phase.

    • Anoneuoid says:

      Many experiments are not well constructed and should have qualified statisticians involved at the design phase.

      I don’t think this is so simple. Most statisticians right now will design things so researchers get the answer to something they don’t actually care about (is there any difference between two groups/scenarios?)

  17. Steve Read says:

    I’m afraid I don’t see why the later “controlled” replication is necessarily privileged over the earlier study. Later “controlled” replications can also be done badly or with the addition of factors that have unforeseen effects. For example, given what we know about self-awareness, its plausible that adding cameras of which the subjects were aware could change the results.

    • Ben Prytherch says:

      This is an important point, and of course if a replication is poorly done, then its results are likely uninformative.

      Regarding why the replication should be privileged, it is because the statistical analysis is not affected by the common biases that typically affect non pre-registered analyses. If the original was pre-registered, I’d give them equal weight. If the original was performed in an environment that allowed for flexibility (as most are), any claim supported by impressive statistical results (e.g. small p-value, large difference in means) is suspect.

      This doesn’t mean that there aren’t other important considerations, such as the ones you point out. It just means that, all else equal, tests of significance and much more credible when researchers pre-register their analyses than when they do not.

  18. Dave Meyer says:

    Great Long-Lasting Science is based on Grand Fundamental Empirical Regularities that are Invariant across space, time, and other mundane changes of context. In the absence of such discovered regularities, you don’t really have a science; you just have a bunch of casual incoherent ‘messing around’. Sadly, right now, it seems like much of Social Psychology and its close off-shoots are just messing around, not long-lasting Science…


    • Martha (Smith) says:

      Your first sentence may apply well to most physics (quantum mechanics being an exception), but biology and its offshoot psychology are more complex, since they depend so much on random events. In those fields, I think we need to frame things in more probabilistic terms.

      • Glen M. Sizemore says:

        If a scientist (or “scientist” as the case may be) finds her or himself feeling like a victim of “randomness,” then it is likely that he or she is not controlling variables. That is, not exerting experimental control over the subject matter. That pretty much characterizes mainstream psychology which is why such psychologists spend so much time with statistical inference. They are generally unable to exert experimental control which would allow them to directly demonstrate that they “have their hands” on potent independent-variables. Needless to say, the natural science of behavior (as opposed to mainstream psychology) doesn’t have this problem.

        • Allan Cousins says:


          But I do believe Martha is correct in asserting that the fields she mentions (and others as well) should frame discussions in terms of probabilistic reasoning. At least while causal models are underdeveloped.

          • Glen M. Sizemore says:

            What do you mean by “probabilistic reasoning”? For example, my opinion is that rate of a response (class) is an important – maybe the most important – dependent-variable for a science of behavior. It certainly seems to get at the issue of “what is the probability that this animal will do X?” Is that what you mean by “probabilistic reasoning” – I sort of doubt it. But, if so, what is special about rate? In the experiments I used to do, rate was simply the compound dimensional quantity count/time. And I can tell you a lot about what affects rate of response, especially in the laboratory looking at “small animals in small spaces.” Isn’t that “causal”? But, I guess it isn’t a “model.” I’m guessing that you mean something specific by “causal model” that isn’t necessarily apparent prima facie in the English words.

      • Martha (Smith) says:

        When I said that biology and psychology depend so much on random events, I did not have in mind anything like a scientist being a victim of randomness; I was referring to the randomness that enters into the development of each living organism,, which leads to the heterogeneity within any species. For example, in human beings, there is randomness in which of the mother’s ova is fertilized by which of the father’s sperm; there is randomness in every cell division leading to the resultant human being (the evidence is strong that the two daughter cells are usually not perfect copies of the cell that divided); there is also randomness in the environmental factors that influence the development of the individual.

        • Glen M. Sizemore says:

          Well…there is no question that everything except reversible physics “has a history” that could affect the impact of variables but I’m not sure what your claim is concerning the conduct of science. I think, for example, that a science of behavior need never concern itself with inferential statistics. Isn’t that what you mean by: “…frame things in more probabilistic terms”? I’m asking…maybe I don’t know what you are saying…

          • Martha (Smith) says:

            No, my statement “frame things in more probabilistic terms” refers to the context of Dave Meyer’s statement, “Great Long-Lasting Science is based on Grand Fundamental Empirical Regularities that are Invariant across space, time, and other mundane changes of context,” at the beginning of this thread.

            To elaborate: If one is looking for “Grand Fundamental Empirical Regularities” in biological creatures (e.g., humans), unless one is considering only a single individual, any such regularities need to be framed in probabilistic terms, because there is almost always variation from individual to individual within any given species or other biological taxonomic group.

            • Glen M. Sizemore says:

              Martha: To elaborate: If one is looking for “Grand Fundamental Empirical Regularities” in biological creatures (e.g., humans), unless one is considering only a single individual, any such regularities need to be framed in probabilistic terms, because there is almost always variation from individual to individual within any given species or other biological taxonomic group.

              GS: Let’s look at a particular example. Say someone wants to know something about the effects of a particular drug on behavior maintained under some schedule of food delivery in, say, a rats. So, you take five rats, train them and maintain behavior under the schedule until the rate of response is stable then, from time-to-time, you inject each rat with different doses of the drug until each has received a range of doses each one often multiply determined. Now…say that rate of response is a bitonic inverted U-shaped function of dose for each subject. So…each subject’s D-E function is the same general shape, but the “height” of the function and the location of the max etc. etc. are different for each subject. So…a regularity is “the rate of response X dose function is bitonic for rats under schedule Y.” Now…of course, it may just be these five rats – unlikely, true, but you haven’t looked at every rat. Is that where the probability that you are talking about comes in? Or, is it the quantitative variation in baseline and drug effect across subjects (even though the D-E function is the same shape). My point is that, if you are looking for regularities in behavior, “probabilistic terms” don’t necessarily come up. But, perhaps this is really talking about probabilities – it’s just that the probability of Drug X producing a bitonic D-E function in a rat is suspected to be 1.0? Just trying to be sure I know where you’re coming from.

              • Martha (Smith) says:

                GS said: “Say someone wants to know something about the effects of a particular drug on behavior maintained under some schedule of food delivery in, say, a rats”

                This does not sound like something that could be called a “Grand Fundamental Empirical Regularity”, so it doesn’t the context that Dave Meyer was talking about and that I was responding to.

              • Glen M. Sizemore says:

                GS said: “Say someone wants to know something about the effects of a particular drug on behavior maintained under some schedule of food delivery in, say, a rats”

                Martha: This does not sound like something that could be called a “Grand Fundamental Empirical Regularity”, so it doesn’t the context that Dave Meyer was talking about and that I was responding to.

                GS (new): But what you quoted from me (i.e., “Say someone wants to know something about the effects of a particular drug on behavior maintained under some schedule of food delivery in, say, a rats”) doesn’t mention any regularity, grand, fundamental, empirical or otherwise. Are you saying that no demonstrated regularity regarding the effects of drugs on behavior could ever be a “Grand Fundamental Empirical Regularity”? Why might that be? But what you are really saying is that you aren’t going to defend or clarify your vague statements. Right?



        • Fritz Strack says:

          I may be a bit naive (Gelman would certainly agree) but for me, “random” has two meanings:

          1) an event is causally undetermined
          2) an event is causally determined but contaminated by uncontrolled / uncontrollable influences (aka error).

          In the case of human behavior, I go for the 2nd interpretation.
          And as a psychologist, I try to reduce the error and gain a deeper causal understanding.
          Perhaps, that’s the difference between statisticians and other scientists.

          P.S.: Don’t worry about the “big shot”. The other insinuations were a bit more offensive.

          • Andrew says:


            Several things here.

            First, there’s nothing wrong with you being naive. None of us can be experts in all things. Statistics is not your area of expertise, and that’s fine. What’s important is to respect the limitations of one’s expertise.

            Second, regarding randomness, see the relevant paragraph of my post above and also this comment.

            Third , I don’t think there’s a distinction that you say between statisticians and other scientists. The replication movement in psychology is led by . . . psychologists. So you might want to take it up with Eva Ranehill, Anna Dreber, Bobbie Spellman, Uri Simonsohn, Brian Nosek, etc etc. Speaking more generally, of course statisticians want to “reduce the error and gain a deeper causal understanding.” That’s what the field of experimental design is all about.

            Fourth, that’s too bad that you’re offended by my opinion that you have made scientific errors. I’m sure it’s too late, but once again I’d suggest that instead of nursing your resentment, that you consider the possibility that your original statistical analysis was in error and that your study did not imply what you thought it did. I understand that it’s hard to let go, especially after decades of recognition. But, remember, it’s not just me! See hre. It’s never too late to learn.

        • Alex Gamma says:

          >(the evidence is strong that the two daughter cells are usually not perfect copies of the cell that divided)

          I’d be highly interested in reading up on this. Could you give me some references?

          • Anoneuoid says:

            >(the evidence is strong that the two daughter cells are usually not perfect copies of the cell that divided)

            I’d be highly interested in reading up on this. Could you give me some references?

            Supposedly you’d expect ~200 point mutations per division and a chromosomal missegregation every 100 or so divisions. These rates increase if a cell becomes “genetically unstable”. Also, perhaps when a cell divides one of daughters retains the “original” DNA, while the other gets the newly synthesized DNA strands. That is only genetic differences, there can of course be others.

            “The average mutation rate was estimated to be ~2.5 x 10^-8 mutations per nucleotide site or 175 mutations per diploid genome per generation.”


            The overall mutation rate in somatic human cells has been estimated at 1.4 x 10^-10 nucleotides/cell/division or 2.0 3 10^-7 mutations/gene/cell division


            Twelve mutations were confirmed in ~10.15 Mb; eight of these had occurred in vitro and four in vivo. The latter could be placed in different positions on the pedigree and led to a mutation-rate measurement of 3.0 x 10^-8 mutations/nucleotide/generation (95% CI: 8.9 x 10^-9 – 7.0x 10^-8), consistent with estimates of 2.3 x 10^-8 – 6.3 x 10^-8 mutations/nucleotide/generation for the same Y-chromosomal region from published human-chimpanzee comparisons [5] depending on the generation and split times assumed.


            Chromosome missegregation rates determined by this method probably underestimate the total chromosome missegregation rate because single chromosomes contained in micronuclei are missed in the FISH analyses. Nevertheless, the rate of chromosome missegregation in untreated RPE-1 and HCT116 cells is 0.025% per chromosome and increases to 0.6 – 0.8% per chromosome upon the induction of merotely through mitotic recovery from either monastrol or nocodazole treatment ( Fig. 3 C ). These basal and induced rates of chromosome missegregation are similar to those previously measured in primary human fi broblasts ( Cimini et al., 1999 ).

            Assuming all chromosomes behave equivalently, RPE-1 and HCT116 cells missegregate a chromosome every 100 cell divisions unless merotely is experimentally elevated, whereupon they missegregate a chromosome every third cell division. Chromosome missegregation rates in three aneuploid tumor cell lines with CIN range from 0.3 to 1.0% per chromosome ( Fig. 3 C ). Depending on the modal chromosome number in each cell line, these cells missegregate a chromosome every cell division (Caco2), every other cell division (MCF-7), or every fifth cell division (HT29).


    • Fritz Strack says:

      The “facial (or bodily) feedback hypothesis” was first proposed by Charles Darwin (1809 – 1882, British naturalist and biologist)

      • Glen M. Sizemore says:

        As ol’ Fred Skinner said, what is “spected” when we introspect is our own behavior (or perhaps one should say “stimuli generated by our own behavior”). That doesn’t mean that stuffin’ a pen in your yap is gonna make you happy. In fairness, sometimes it is difficult to tell whether Gelman is criticizing a notion a priori or is attacking the experiment per se. He frequently seems to be attacking “embodied cognition” as ridiculous a priori – unfortunate since EC has a chance (a slim one albeit) to help mainstream psychology become a science (there already is a natural science of behavior – it just isn’t mainstream). The good thing about EC is that it is closet behaviorism. The bad thing is that it’s in the closet.

  19. Milton Strauss says:

    Anyone here old enough (other than me) to remember the extensive literature on experimenter expectancy effects (Robert Rosenthal in the 60s and early 70s) and (to bring in “soft psychology) allegiance effects in clinical trials in psychology and psychiatry? Strack’s position is both anti-scientific. If replication is so difficult, how much generalizability might one expect? We are still a pretty soft science and easy of replication seems necessary for confidence that there is something theoretically interesting to explore. Too much is just a house of cards.

  20. Jordan Anaya says:

    Oh, I forgot to point out that Wansink used to be at the University of Illinois. Maybe he got all his statistical advice from Fritz.

    • Carol says:

      Hi Jordan,

      No. Fritz Strack was at the University of Illinois as a post doc in the late 1980s. Brian Wansink was at the University of Illinois as a professor from 1997 to 2005. So, they did not overlap.


  21. psyoskeptic says:

    Strack’s responses here are primarily in defence of the truth of the finding primarily and not addressing the flaws in the paper. It’s really pretty embarrassing. Just because you happen upon something that might even be replicable in a limited case (that you didn’t know about) using questionable practices doesn’t all of a sudden make them OK.

    Getting back to the earlier posts about why people get defensive something that is rarely discussed is the degree of motivation. I don’t know about Strack, and don’t care to look it up, but people like Cuddy and Bialystok aren’t just defending an idea. They receive criticism that attacks not just a hypothesis but the entire foundation to large amounts of funding from multiple sources, their salaries, careers, employees, everything, in some cases their entire world as they know it would change would it be true that they were wrong. They don’t have diversified research interests and everything relies on the one finding being true. If it’s not then they’re toast. I’m betting more diversified researchers tend to be less defensive overall. I think of Kahneman’s response to the whole priming thing, or Klein (one of Bialystok’s early co-authors) acceptance of the non-replicatbility of the bilingual advantage. Neither of them need for the original findings to be true. Klein has many irons in many fires. In Kahneman’s case he had alternative more solid support of his own for many of his ideas. Even Cuddy’s co-author who retracted support has completely moved on so it was of little cost to her.

    • Martha (Smith) says:

      “I’m betting more diversified researchers tend to be less defensive overall.”

      Interesting possibility. Quite plausible.

    • Anonymous says:

      “I don’t know about Strack, and don’t care to look it up, but people like Cuddy and Bialystok aren’t just defending an idea. They receive criticism that attacks not just a hypothesis but the entire foundation to large amounts of funding from multiple sources, their salaries, careers, employees, everything, in some cases their entire world as they know it would change would it be true that they were wrong. They don’t have diversified research interests and everything relies on the one finding being true. If it’s not then they’re toast.”

      In the replies to the following post Strack (assuming it is him) states the following:

      “During my career as a social psychologist, I have been predominantly focussing on judgments and social cognition. Together with a colleague, I have developed a dual-systems models that has been frequently cited, much more often than the “classic” pen study, that is certainly not my main identification.”

  22. Fritz Strack says:

    You may not like it, but that’s science
    (current metaanalysis on facial feedback)

    Note: Science is more than statistics.

    • Statistical meta analysis of literature full of meaningless NHST finds publication history supports researchers favorite hypothesis (p < 0.05).
      Stay tuned for details at 11:00

      • Andrew says:


        I clicked on the link. It included this bit: “p < .000000005." I guess their data really really didn't come from that particular random number generator! Fritz: I agree that science is more than statistics. In particular, I think that scientific understanding has very little to do with learning that your data did not come from a particular random number generator.

        • Psychology Researcher: Some people are claiming that our entire literature is full of unreplicable meaningless cargo cult studies in which people find whatever it is they want to find and figure out a way to put a p value on it. Some of these people are even well respected leaders in our field who have been saying it for decades such as Meehl. People say this invalidates process X which we’ve been studying for decades.

          Second Psychology Researcher: Nah, that can’t be right, let’s meta-analyze the literature to see if we can find consistent support for process X.

      • Keith O'Rourke says:

        Apart from the likely near impossibility of doing a credible meta-analysis of published studies (as if one can control for publication bias) – there does seem to be the usual – the primary claim was not significant but hey look at this other outcome that was significant!

        “However, after controlling for publication bias, there is no evidence that facial feedback influences perceptions of affective quality, but evidence that it does influences emotional experience.”

    • Anoneuoid says:

      Note: Science is more than statistics.

      I don’t know anything about your work, but I have seen/heard a similar argument before. It is like when errors are found in a paper but they seem to rarely affect the conclusions, the same goes for statistical technique. People will claim the statistics just aren’t very important to their actual understanding (despite that p less 0.05 is everywhere in their papers and spreadsheets…why?).

      I have even been told that replications are all being done without publication, the result are being spread by word of mouth at conferences. via email, etc, and that is the real reason to believe or not believe something is going on. The published literature is just irrelevant, so it is impossible for an outsider to even have an opinion on the topic.

    • Ben Prytherch says:

      It seems prof. Strack still thinks that all of the criticism leveled at him here and over at Neuroskeptic and in Nature and in Slate is primarily coming from people who want to show that the facial feedback hypothesis is false.

      No, this criticism is coming from people who couldn’t give a damn either way about facial feedback. The criticisms are of Strack’s assertion that failed pre-registered replications don’t count for much and that these whole “reproducibility” and “publication bias” worries are overblown and that the next course of action following a 17-study replication paper that turned up nothing is to figure out why half delivered a negative sign and the other half a positive one, because after all when you think about it there’s no such thing as chance.

      The validity of the criticisms leveled here and elsewhere are independent of whether or not putting a pencil in your mouth really makes jokes seem funnier. But Strack has returned after a month and a half – not to answer anything of substance, but to flaunt a preprint of a meta-analysis that claims to find evidence that there is maybe something to facial feedback after all.

Leave a Reply to Alex Gamma