Facial feedback: “These findings suggest that minute differences in the experimental protocol might lead to theoretically meaningful changes in the outcomes.”

Fritz Strack points us to this article, “When Both the Original Study and Its Failed Replication Are Correct: Feeling Observed Eliminates the Facial-Feedback Effect,” by Tom Noah, Yaacov Schul, and Ruth Mayo, who write:

According to the facial-feedback hypothesis, the facial activity associated with particular emotional expressions can influence people’s affective experiences. Recently, a replication attempt of this effect in 17 laboratories around the world failed to find any support for the effect. We hypothesize that the reason for the failure of replication is that the replication protocol deviated from that of the original experiment in a critical factor. In all of the replication studies, participants were alerted that they would be monitored by a video camera, whereas the participants in the original study were not monitored, observed, or recorded. . . . we replicated the facial-feedback experiment in 2 conditions: one with a video-camera and one without it. The results revealed a significant facial-feedback effect in the absence of a camera, which was eliminated in the camera’s presence. These findings suggest that minute differences in the experimental protocol might lead to theoretically meaningful changes in the outcomes.

We’ve discussed the failed replications of facial feedback before, so it seemed worth following up with this new paper that provides an explanation for the failed replication that preserves the original effect.

Here are my thoughts.

1. The experiments in this new paper are preregistered. I haven’t looked at the preregistration plan, but even if not every step was followed exactly, preregistration does seem like a good step.

2. The main finding is the facial feedback worked in the no-camera condition but not in the camera condition:

3. As you can almost see in the graph, the difference between these results is not itself statistically significant—not at the conventional p=0.05 level for a two-sided test. The result has a p-value of 0.102, which the authors describe as “marginally significant in the expected direction . . . . p=.051, one-tailed . . .” Whatever. It is what it is.

4. The authors are playing a dangerous game when it comes to statistical power. From one direction, I’m concerned that the studies are way too noisy: it says that their sample size was chosen “based on an estimate of the effect size of Experiment 1 by Strack et al. (1988),” but for the usual reasons we can expect that to be a huge overestimate of effect size, hence the real study has nothing like 80% power. From the other direction, the authors use low power to explain away non-statistically-significant results (“Although the test . . . was greatly underpowered, the preregistered analysis concerning the interaction . . . was marginally significant . . .”).

5. I’m concerned that the study is too noisy, and I’d prefer a within-person experiment.

6. In their discussion section, the authors write:

Psychology is a cumulative science. As such, no single study can provide the ultimate, final word on any hypothesis or phenomenon. As researchers, we should strive to replicate and/or explicate, and any one study should be considered one step in a long path. In this spirit, let us discuss several possible ways to explain the role that the presence of a camera can have on the facial-feedback effect.

That’s all reasonable. I think the authors should also consider the hypothesis that what they’re seeing is more noise. Their theory could be correct, but another possibility is that they’re chasing another dead end. This sort of thing can happen when you stare really hard at noisy data.

7. The authors write, “These findings suggest that minute differences in the experimental protocol might lead to theoretically meaningful changes in the outcomes.” I have no idea, but if this is true, it would definitely be good to know.

8. The treatments are for people to hold a pen in their lips or their teeth in some specified ways. It’s not clear to me why any effects of this treatments (assuming the effects end up being reproducible) should be attributed to facial feedback rather than some other aspect of the treatment such as priming or implicit association. I’m not saying there isn’t facial feedback going on; I just have no idea. I agree with the authors that their results are consistent with the facial-feedback model.

P.S. Strack also points us to this further discussion by E. J. Wagenmakers and Quentin Gronau, which I largely find reasonable, but I disagree with their statement regarding “the urgent need to preregister one’s hypotheses carefully and comprehensively, and then religiously stick to the plan.” Preregistration is fine, and I agree with their statement that generating fake data is a good way to test it out (one can also preregister using alternative data sets, as here), but I hardly see it as “urgent.” It’s just one part of the picture.

31 thoughts on “Facial feedback: “These findings suggest that minute differences in the experimental protocol might lead to theoretically meaningful changes in the outcomes.”

  1. Not my field. So, I’ve tried holding a pen between my teeth and between my lips. I find the latter is harder to control, so I have to concentrate more on what I am doing. I’m not convinced that smiling has much to do with it at all. I have to wonder if this line of research would pass Raghuveer’s proposal in the prior post about being a “big” idea.

  2. “In all of the replication studies, participants were alerted that they would be monitored by a video camera, whereas the participants in the original study were not monitored, observed, or recorded.”

    Hmm, are we sure about this latter part?

    From Strack’s original paper:

    “Subjects were run in groups of 4. After their arrival, the subjects were each assigned a cubicle that prevented communication between the subjects but allowed them to communicate with the experimenter through an open space.”

    I quickly further read Strack’s paper, and i couldn’t find anything about whether the experimenter left the room or not. My guess would be that the experimenter may very well have stayed in the room based on the outline of the procedure in Strack’s paper, and thus could have very well “monitored” or “observed” the participants in the original study!!

    (Side note: if the experimenter did leave the room, how could Strack et al. know the participants actually followed instructions regarding holding pens in their mouths?)

    If the experimenter did not leave the room, i would argue that the subjects could very well 1) have been “monitored” or “observed” in the original study (contrary to what this new study claims in its abstract, see quote above), and 2) may have felt “being watched” in the original study which may have had “theoretically meaningful changes in the outcome”.

    • Fritz Strack posted a (self-congratulatory-seeming) link to this post on the Facebook Psych Methods Discussion group, and I asked him there if the experimenter(s) left the room, and, if not, if the participants were monitored and/or observed. No response yet.

        • “No, the experimenter did not leave the room and the participants were not monitored.”

          1) If the experimenter did not put on a blindfold or something like that, i am assuming the experimenter may have “observed” the participants in the original study.

          2) I reason “monitored” is a valid synonym for “observed” (https://www.thesaurus.com/browse/observe)

          If 1) and 2) are correct, i repeat my writing above here for a 2nd time and would reason the participants in the original study may very well have been, and felt, “observed” and “monitored” which could have had “theoretically meaningful changes in the outcome” of the original study.

  3. Ho: There are no group differences.

    H1: “These findings suggest that minute differences in the experimental protocol might lead to theoretically meaningful changes in the outcomes.”

    H2: “Any series of statistically significant coefficients can be interpreted within the broad theory.”

  4. “I’d prefer a within-person experiment”. Andrew, could you elaborate a bit on this? I’ve seen you discuss it before on a few separate blog posts as a means to address the noise and variability between groups, but then wouldn’t you lose the benefits of contrasts in clinical trials and things they address (regression to the mean, placebo effects etc)?

      • In social psychology, within-subjects designs always carry the risk of demand effects that may operate both in favor or agains the experimenter’s presumed hypothesis. It’s an easy way to increase the power but it may undermine the validity of the results.

    • I’ve seen many pictures of how one is supposed to look like when holding the pen in their mouth, and I’ve now spent way too much time in front of the mirror with a pen in my mouth, but I don’t know what’s wrong with my face since my expressions look nothing like in those pictures (e.g. in the fig.1 of the that study). I don’t get why one should by cringing so much when holding the pen between their teeth (while not letting it touch lips).

      Also, as Dale, I don’t really get what any of this has got to do with smiling. Holding the pen between teeth felt a bit easier compared to between lips which was a bit more straining, and I felt like an idiot sucking on a pen, then started to get some dirty thoughts after which I got paranoid and started wondering if this is some large-scale meta-psychological study about suckers suckin’ on pens for no reason. Perhaps holding the teeth between teeth is just slightly more comfortable and less humiliating/paranoia inducing. I don’t know, uh, I need a cup of tea to clear my mind.

      Also, I was thinking…. how about using a device similar to the one used in the re-neducation center in an episode of The Simpsons. That’d be somethin’!

      • Quite a possibility! Lots of research has shown that experienced effort (e.g., by contracting the eyebrows) enters into all kinds of judgments.

        I very much appreciate this discussion (along with Andrew’s comments) because it is not about “real” or “false positive” but about the arguments that are supported or not supported by the data and the procedure that has created them.

        As I have frequently argued: There is no direct route from data to truth.

    • Quote from above: “Here is a more recent study (Emotion, in press) replicating the original pen procedure with greater significance and power”

      If it is true that “minute differences in the experimental protocol might lead to theoretically meaningful changes in the outcomes”, it could be important (if not crucial) to mention that (if i am not mistaken) the more recent study referred to in the above post did not replicate the exact procedure of the original study (as the authors of this recent study themselves mention in their paper). To name just 3 differences between this recent study and the original Strack et al. study:

      1) This recent study seems to have tested participants in a classroom setting, whereas in the original study subjects were run in groups of 4, and placed in cubicle that prevented communication between the subjects but allowed them to communicate with the experimenter through an open space.

      2) Furthermore, this recent study seems to have used a 7-point scale and not a 10-point scale as used in the original study to rate the cartoons.

      3) And if i am not mistaken, it seems that in this recent study participants took the pen out of their mouths at the moment of, and to be able to, rate the cartoons. Note that this did not happen in the original paper where (in study 2) “Half of the subjects held the pen with their lips (or teeth) both when they were presented with the humorous stimuli and when they rated them. The remaining subjects were instructed to hold the pen in the appropriate position only when they gave their ratings.” (see http://datacolada.org/wp-content/uploads/2014/03/Strack-et-al-1988-cartoons.pdf)

      I guess only certain “minute differences in the experimental protocol might lead to theoretically meaningful changes in the outcomes”. Or perhaps only at a certain time and/or place. Is it still possible to make anything of all these “findings” in a “theoretically meaningful manner”…

      • Basic science is about testing theories and not about generalizing experimental findings. Minute differences under highly controlled conditions may be theoretically diagnostic while the effects may not be robust enough to be genralized to natural contexts. That’s the difference between basic and applied science.

        • It needs to be clear what is actually predicted by the theory, and what is post-hoc altered in the theory to match the measurements.

          The biggest advantage in pre-registering is really if the theory can be made to predict precise quantitative results and they are borne out. In particular, if the theory ahead of time predicts how the “minute differences under highly controlled conditions” should come out, then you really do have a claim to science. I fear too often we have post-hoc post-dictions to explain why minute differences occurred. Even worse, if the minute differences didn’t really occur but instead it’s all down to measurement difficulties.

        • It seems to me that a good theory should, along with any necessary auxiliary hypotheses, indicate how and when results generalize. There’s no necessary trade off between theory testing and generalization.

        • A good theory should at least be able to say when it would predict generalization, and when it would predict unknown results (ie. might or might not generalize). If the theory fails to generalize to conditions it predicts should generalize that itself is a kind of theory test.

          In a Bayesian context this could be in terms of a massive widening of the likelihood in regions where it doesn’t necessarily expect to predict, and relative narrow likelihood in regions where it does expect to predict.

        • Psychiatrists are currently treating depression by suppressing currugator contractions (via Botox). Although this is a direct application of of facial-feedback theory, it was neither predicted by Charles Darwin (who is the theory’s author) nor preregistered (as far as I know) in any study.

        • You replied to my comment: “Minute differences under highly controlled conditions may be theoretically diagnostic while the effects may not be robust enough to be genralized to natural contexts”

          When i wrote my comment, i wasn’t even thinking about “applied science”, just about “basic science”.

          From my point of view we have 3 recent findings presented in this discussion here: 1) the large scale Registered Replication Report with the video camera not showing the effect, 2) the Noah et al. study with and without the video camera, respectively not showing and showing the effect, and 3) the “classroom”-study without camera showing the effect.

          Now, what could we learn from these recent findings concerning “basic science”? Did the video and no video-Noah et al. study test anything concerning “feeling observed” or any other “theoretically” based hypothesis? They just used a video and no-video condition (if i understood things correctly). They even speculate on page 661 of their paper that these findings could be due to “feeling observed” and/or “accountability” and refer to lots of papers, but nothing is really tested concerning any of these possible mechanisms.

          From my perspective, due to several problematic issues in psychological science, you don’t have a valid and informational “body of work” to even base “theories” or “hypotheses” on. Furthermore, the “theories” are so broad and vague that they they can be used to explain or predict just about anything. It seems to me, you can just go through the literature and pick and choose what findings and/or “theory” you want to use to build whatever case you want.

          Just like some people could use the Noah et al. paper to build the case for how “observing” participants influences their facial feedback process, others can build the case for how this isn’t the case because in the original study, and the other “classroom”-study referred to here in the comment section, participants were also “observed” (by the experimenter).

          Just like some people could use the Noah et al. study to build the case for how “minute differences in the experimental protocol might lead to theoretically meaningful changes in the outcomes”, others can build the case for how this isn’t the case because of all the other “minute differences in the experimental protocol” that did not lead to changes in the outcome of some studies.

  5. I just noticed the full title of the paper is “When Both the Original Study and Its Failed Replication Are Correct: Feeling Observed Eliminates the Facial-Feedback Effect”. Please correct me if i am wrong, but i can not find anything about testing “feeling observed” in the paper.

    I think they may not even have tested “feeling observed” in this study, which is in line with their own writing which states that “The presence (vs. absence) of the camera might have induced the
    feeling of being observed, (…)” (p. 661) (emphasis on the “might” here).

    Altogether, i think the following title may have been more suitable: “When Both the Original Study and Its Failed Replication Are Correct: Recording the Participants During the Experimental Task With a Video Camera Eliminates the Facial-Feedback Effect”

    • I’m not sure what you aim to achieve by citing this article – it makes me less convinced. 24 runners? Was it randomized and how many confounding variables might possibly have impacted the results? This is sounding more and more like the power pose studies. In fact, I’m not sure there is really a lot of difference between these lines of research.

      • It may be the beginning of an interesting research program. Like the treatment of depresseion by supressing corrugator (eye brow) contractions. Who knows?

        • If you can improve my golf game by teaching me to smile and feel powerful while swinging, then I’m in. Otherwise, I’ll pass on this research program.

  6. From the Noah et al. paper: “As researchers, we should strive to replicate and/or explicate, and any one study should be considered one step in a long path”

    I wonder how researchers could stop potentially going round in circles.

    If you only look at the road a few steps ahead, and/or never turn around to see where you have been, you could end up going round in circles.

    It may then look like you are participating in a “cumulative science”, and making “slow progress”, but you may in fact not be…

    In light of this pondering, i like to refer to a paper (cited by the original Strack et al. paper, but not by this new Noah et al. paper) by Buck (1980) about the “facial feedback hypothesis”. In 1980 (almost 40 years ago !!!!!), based on many papers, findings, and theories, Buck wrote the following in the paper’s conclusion section:

    “This article argues that the evidence for theories positing that facial feedback has a major causal role in
    emotional processing is unconvincing.”

    “At present there is insufficient evidence to conclude that facial feedback is either necessary or sufficient for the occurrence of emotion, and the evidence for any contribution of facial feedback to emotional experience is less convincing than the evidence for visceral feedback.”

    “It is argued instead that facial expression has evolved in humans as a means of affective communication and that facial expressions and other nonverbal behaviors provide a controlled “readout” of central affective processes.”

    (Side note 1: if these conclusions by Buck from 1980 make sense, couldn’t that imply that “feeling observed” might be a crucial “theoretically important” aspect of an experiment concerning eliciting “the facial-feedback effect” but in just about the exact opposite way that Noah et al. seem to reason?

    If facial expressions may have evolved in humans as a means of affective communication, and provide a controlled readout of central affective processes, wouldn’t it then make (“theoretical”) sense that feeling observed in a facial-feedback experiment might be crucial to elicit and/or enhance the facial-feedback effect?

    (Side note 2: here’s “Bad Company” with “Crazy Circles”: https://www.youtube.com/watch?v=g0tb9Tp1nVA)

    • “If facial expressions may have evolved in humans as a means of affective communication, and provide a controlled readout of central affective processes, wouldn’t it then make (“theoretical”) sense that feeling observed in a facial-feedback experiment might be crucial to elicit and/or enhance the facial-feedback effect?”

      It may well be the case that people’s experience is not affected, but research suggests that the use of this experience as a basis of judgments (e.g., of the funniness of cartoons) is undermined if a camera is directed on them.

      • Quote from above: “It may well be the case that people’s experience is not affected, but research suggests that the use of this experience as a basis of judgments (e.g., of the funniness of cartoons) is undermined if a camera is directed on them.”

        I am trying to make sense of this all on a “basic science” and “theoretical” level. Please note that that “facial feedback affecting experience”, if i understood things correctly, is just about the absolute core of the “facial feedback hypothesis” according to Strack et al.’s (1988) paper where the following is written: “Although distinctions were made among several variants of this hypothesis (e.g., Buck, 1980; Winton, 1986), its core is the “causal assertion that feedback from facial expressions affects emotional experience and behavior” Buck, 1980, p. 813)”)

        Your comment raises 3 questions:

        1) Am i understanding you correctly here that you are now possibly agreeing with (the gist of) Buck’s 1980 conclusions that “facial feedback” might not (substantially at least) affect people’s emotional experience?

        2) Does that mean you (no longer?) think the “facial feedback hypothesis” is worthy of further study? (because if “facial feedback” does not (substantially at least) affect people’s emotional experience after all, why should anyone possibly waste resources to investigate it further)

        3) Does this mean you would agree that a study that, for instance, uses cameras that (says it) records people that hold a pen in their mouth in strange ways and have them rate the funniness of cartoons, is actually possibly investigating the effect of cameras on judgments (and not, for instance, something like “facial feedback”)?

Leave a Reply to Fritz Strack Cancel reply

Your email address will not be published. Required fields are marked *