Turbulent Studies, Rocky Statistics: Publicational Consequences of Experiencing Inferential Instability

Someone writes in:

In the most recent absurd embodiment paper on wobbly stools leading to wobbly relationship beliefs, Psych Sci asked for a second study with a large N. The authors performed it, found no effect and then performed a mediation analysis to recover the effect. It’s a good example for garden of forking paths given that the mediation analysis is decided on post hoc and there are a number of ways to approach the problem.

No kidding! The article, by Amanda Forest, David Kille, Joanne Wood, and Lindsay Stehouwer, is called, “Turbulent Times, Rocky Relationships: Relational Consequences of Experiencing Physical Instability.” Almost a parody of a Psych Science tabloid-style paper. From the abstract:

Drawing on the embodiment literature, we propose that experiencing physical instability can undermine perceptions of relationship stability. Participants who experienced physical instability by sitting at a wobbly workstation rather than a stable workstation (Study 1), standing on one foot rather than two (Study 2), or sitting on an inflatable seat cushion rather than a rigid one (Study 3) perceived their romantic relationships to be less likely to last. . . . These findings indicate that benign physical experiences can influence perceptions of relationship stability, exerting downstream effects on consequential relationship processes.

This is no joke. Here’s how the paper begins:

The earthquake that struck Sichuan, China, in 2008 made headlines not only because of the tremendous loss of life it caused, but also because after the quake, Sichuan came to lead the country in number of divorces (Zhiling, 2010). Experts and popular media outlets made causal claims (e.g., “Earthquake Boosts Divorce Rate,” 2010). If the earthquake truly caused changes in Sichuan’s divorce rate, why might this be? Emotional distress, financial hardship, and mortality salience may well contribute. Sociologist Guang Wei speculated that Sichuan residents “decided to live each day to the fullest . . . if they do not get along with their spouses, they decide to part ways” (Zhiling, 2010, paras. 9–10). We examine a different feature of earthquakes that may affect relationships: physical instability.

Don’t get them wrong, though. They very graciously admit to not having the whole story:

Certainly, the shaking ground was not solely responsible for the change in the divorce rate in Sichuan.

That’s a relief!

Getting to my correspondent’s criticisms, yes, lots of forking paths in preparing the dataset:

Data for 54 participants were collected. Because our main measure was perceived stability of a person’s relationship with a particular partner, only data from participants in exclusive romantic relationships were included in the analyses (36 exclusively dating, 8 cohabiting, 2 engaged, and 1 married) . . . Data from 3 participants—1 who stood instead of sitting and 2 who communicated with friends while completing the questionnaire—were omitted from analyses. . . . Participants responded to six items (α = .96) regarding the stability of their current romantic relationship. These included the four items from Study 1 as well as similar items querying confidence in still being together in 10 and 20 years. . . . Data from 4 participants—1 who reported not having followed the posture instructions and 3 who correctly guessed the study hypothesis—were omitted from analyses. . . .

And forking paths in the analysis:

We averaged participants’ ratings of their beliefs that they would remain with their partners over each of the four time periods in the items on relationship stability. . . . Mediation analysis using PROCESS Model 4 revealed a significant indirect effect of condition on reports of relationship quality via perceived relationship stability . . .

Lots and lots of mediation analyses. But what happened to the main effect in the replications?

The physical-stability manipulation used in this study did not produce significant condition differences in negative affect. . . . Contrary to prediction, a one-way ANOVA yielded no evidence of a direct effect of condition on perceived relationship stability.

And our old friend, the difference between significant and non-significant:

Participants’ experience of negative affect did not differ between the two conditions, F < 1, which suggests that mood is not a viable alternative explanation for the observed condition differences. . . . A model in which the order of perceived relationship stability and relationship quality was reversed did not yield a significant indirect effect . . .

And good old “marginally significant”:

Although we observed only a marginally significant effect of posture condition on perceived relationship stability, it is widely accepted that indirect (i.e., mediated) effects can be examined even in the absence of any direct link between a predictor and outcome.


The research team found a newsworthy result which did not appear in the replications. But that didn’t stop them from doing some mediation analyses and finding some statistical significance and some non-significance in various places along their forking paths. They wove this together and wrote it up as if they’d discovered something important.

Let’s check the score. Again, from the abstract:

Participants who experienced physical instability by sitting at a wobbly workstation rather than a stable workstation (Study 1), standing on one foot rather than two (Study 2), or sitting on an inflatable seat cushion rather than a rigid one (Study 3) perceived their romantic relationships to be less likely to last.

Is this true? For study 1, yes, after all their choices in data construction and data analysis, they achieved “p less than .05” (p=.034, to be precise). For study 2, despite their flexibility in excluding people and defining the outcome, they were only able to get p down to .069. For study 3, nothing at all, “F less than 1,” as they put it. And they really did have lots of things to win—they bought lots of tickets for the “p less than .05” lottery. For example:

For our behavioral measure, participants were asked to select and send a “thinking of you” electronic greeting card (e-card) to their romantic partners. Each participant chose an e-card design from six choices that had been prerated and selected to vary in intimacy (for details, see Supplemental Material). The intimacy of the card design selected was one outcome of interest. However, we observed no direct or indirect effects of stability condition on card design intimacy, so we do not discuss it further.

The authors get full credit for reporting this—but no credit for realizing what this sort of thing does to their analysis! They consistently report their successes in detail and downplay the null findings. That’s called capitalizing on chance.

Published in Psychological Science: if we reward this sort of research behavior, I see no reason we won’t get lots more of it. I have no reason to think the authors and journal editor are trying to mislead anyone; rather, I’m guessing they’re true believers. They did their own replication and it failed. But they did not do the next step and place their theory and methods under criticism. Too bad.

30 thoughts on “Turbulent Studies, Rocky Statistics: Publicational Consequences of Experiencing Inferential Instability

    • Pre-registered hypothesis: no-one from there will post anything here. I think they live in there own little world there, where recommending your friends as reviewers and all that good stuff still reigns supreme.

      Perhaps you can’t blame the authors of the study. Maybe they weren’t taught any other way. I know i wasn’t.

      It is also often confusing to hear the different thoughts on, for example, data analysis flexibility from different scholars. For example, Simmons, Nelson, & Simonsohn (http://pss.sagepub.com/content/22/11/1359.full.pdf) show you can make just about anything become “significant” with enough use of little tricks. “That’s bad” i thought when reading that, i should watch out for that in my own studies. But a few years later i read “Encourage playing with data and discourage questionable reporting practices” by Wigboldus & Dotsch
      which (seems to me) to make the point that everything is fine when it comes to data analysis flexibility as long as you report what you did. So, as a young researcher, what do i make of this? Do these two articles contradict each other concerning the take-home message? I think they do, but i’m not even sure. More importantly: i thought Simmons, Nelson, & Simonsohn showed the danger of data analysis flexibility for a credible science, but apparently when reading Wigboldus & Dotsch this isn’t really a problem as long as you report what you did.

      What to make of this?

      • Agreed; the data analysis education in psychology leaves a lot to be desired. But I still think that authors can (and should) be blamed for making these kinds of errors.

        As you noted, these two articles are not incompatible; data exploration is totally fine when it is reported as such, so that scientific peers can adjust their level of confidence in the results. But data analytic decisions that are in any way data-dependent lead to invalidated p-values and false evidence when presented as “confirmatory.”

      • They’re not contradictory articles. Simmons et al don’t endorse strict regimented analysis that precludes looking at potentially interesting effects you hadn’t planned on. W&D encourage you to report everything you did. There is no conflict. As long as you report everything honestly and talk about it reasonably you’re OK. I’m not saying that they entirely agree but I don’t see much conflict.

        Look at what Gelman is criticizing. It’s not that there exists, and was analyzed, an effect here and no effect there. It’s the way those things are discussed and the hidden rationale for many of the decisions on what to report or analyze. Both of the articles you cite discourage these things.

        • AJ and psyoskeptic are right: W&D was not meant to be incompatible with Simmons et al.; it relies on the distinction between confirmation and exploration. Only the first tests a hypothesis. As we argue in our paper, if you want to perform a confirmative experiment, there should be no flexibility: ideally you pre-register your hypotheses, method, and analysis plan, and then carry out your work exactly as you planned. If you want to further explore your data, that’s okay, but you need to be fully transparent about it and strictly separate confirmative from explorative analyses in your reporting. Anything you hit upon in data exploration will simply not constitute confirmation. What value your explorative findings have, is something for the reader to decide (there’s no real agreement on whether exploratory analyses are informative at all, although I personally believe there is merit to data-driven research). In any case, explorative findings need to be replicated to be confirmed.

          Just to be very explicit, our paper was not meant to be justification for those who walk the garden of forking paths without a map (i.e., for those who exploit their researchers degrees of freedom): even when you are transparent about it, the reported findings will always be exploratory. The exploratory findings will not count as confirmation. Instead, the paper was meant to help researchers realise that data exploration is still permissible, as long as you are honest about it.

          With regards to the wobbly chair findings, you wrote: “no-one from there will post anything here. I think they live in there own little world there, where recommending your friends as reviewers and all that good stuff still reigns supreme”. Please don’t generalise. This description may apply to some, but not to all. I am a member of that particular social psych Facebook page. I am very skeptical about these sort of papers and I know that I am not the only social psychologist. Things are changing in our field.

        • Thank you all for the information! It is much appreciated, as it allows me to try and get a grip on things i view as contradictory/ confusing.

          I think my confusion comes from reasoning beyond the basic exploratory, confirmatory distinction. I (think i) understand that reporting that you explored your data allows others to judge the evidence more appropriately (compared to presenting it as confirmatory when it was not), but (as i understand it) that still leaves the main point i took from Simmons et al. paper open: the chances of the results presented being false-positives is high when you compare/investigate a lot of variables in an exploratory way…

          I reason that exploratory data analysis severely enhances the chances of false-positives (what i took as the main message from Simmons et al.), i don’t care if you report what you did or not (as i reason that has no effect on the enhancement of the chances of false-positives).

          I still reason that the literature will be cluttered with false-positives when everyone will explore their data in multiple ways and report it as exploratory work. The reporting part will not change that, if i understood things correctly. That’s why i still don’t understand the point of the W & D paper, and that’s why i think the Simmons et al. paper could have just as well been named “False-positive psychology: flexibility in data collection and analysis allows presenting anything as significant” (so without “undisclosed” in the title).

          Should i be wrong about any of this, please correct me, so others can learn from it/ read it. I will not take up more of any of your time. I really appreciate your willingness to try and provide me with information i can possibly learn from. All the best with your work: i hope it will contribute to improving research practices and provide us all with useful/informative evidence.

        • Hi Q, the idea is that when you are transparant about what you did (full disclosure) the gatekeepers (editors/reviewers) should be able to judge whether your approach inflates false positive error rate. When these gatekeepers want confirmatory evidence, exploration will not do and a transparently reported exploration should be rejected. There are also ways to explore data while controlling for false positive rate. We do not argue that all exploration should be published if its transparent. In almost all cases an additional confirmatory study would be needed.

        • Ron and Q,

          Unfortunately, the “gatekeepers” do not always do their job well.

          Also, there are degrees of transparency. For example,featuring an exploratory “finding” in an abstract is quite different from saying (perhaps in a section called something like “Questions for further research”), “We also did some unplanned testing and found statistical significance of ___. However, this might be a spurious finding, so further data need to be collected to check out whether the effect is confirmed and whether it is practically significant.”

    • I’m not on the FB group, but this paper was met with incredulity among the social psychologists at my university. There seems to be some editors/reviewers who are more likely to let things slide than others. I hope they knock it off soon. I’m sick of reading about these studies in the NYT.

  1. What’s most weird to me is the idea that living in a city that had a massive earthquake that caused tremendous loss of life and destruction of communities would somehow be the same thing as sitting on an inflatable pillow. Who thinks that?

  2. There is an interesting classical about the misattribution of arousal showing that physical arousal leads to MORE sexual closeness? I am not sure who made it and when the study was conducted but I think they could have found effects in the opposite direction if they had wanted to… :-)

  3. Andrew – proposal for journal quality metric: Is a paper in Journal X more likely to be motivated by Google search or Google Scholar search?

    Google results for “Sichuan, China earthquake 2008 divorce”:


    Example: “The wave of marriages and divorces after the earthquake seem to be an indication that the disaster has influenced and changed peoples’ mentality and attitude toward life.” ( http://www.theepochtimes.com/n3/1732422-marriage-and-divorce-rates-rise-in-sichuan-region-after-quake/ )

    Google Scholar results for “Sichuan, China earthquake 2008 divorce”:


    Example: “The study investigated the symptoms of posttraumatic stress disorder (PTSD) and associated risk factors among adult survivors 3 months after the 2008 Sichuan earthquake in China. One thousand five hundred sixty-three earthquake survivors in two communities participated in the study. The prevalence of probable PTSD was 37.8% and 13.0%, respectively, in the two communities that were affected differently by the earthquake. The significant predictive factors for the severity of PTSD symptoms were female gender, subnationality, lower educational level, lower social support, and higher initial exposure level. The results indicate that PTSD is also a common mental health problem among earthquake survivors in China.” ( http://onlinelibrary.wiley.com/doi/10.1002/jts.20439/abstract )

  4. The first few paragraphs seem to be an argument from personal incredulity. The “forking paths” argument seems to be that the (reasonable) exclusion criteria are somehow suspect – that we should be suspicious of the authors who took the time and space to present DVs that weren’t significant. I’m not entirely sure why mediation analyses should not be allowed when you have an a priori reason to look for mediation; statistical suppression means that you can have a clear causal link without any significant main effects (in much the same way that gravity continues to operate without us falling through the ground).

    We are entering a new era of open science – it is a challenging time, with many fields and journals slowly changing their ways. We can either have a more complete picture of the data with all the messiness it entails, or we can have people selectively reporting the “clean” story. Reporting marginal effects, dependent variables that weren’t consistent with predictors and the full exclusion criteria that were used are a part of the movement toward more complete reporting, and should be applauded rather than shunned. Reality is almost always messy; open science requires that people have a greater tolerance for the subtleties and ambiguities that more complete reporting produces.

    • Daniel:

      1. Personal incredulity is indeed relevant. Let me put it another way. These sorts of studies get published, and publicized, in part because of their surprise value. (Elderly-related words can prime slow walking! Wow, what a surprise!) But when a claim is surprising, that indeed can imply that a reasonable prior will give that claim a low probability.

      2. I think it’s wonderful for people to analyze all outcomes. That’s just great. But then the researchers should use an appropriate analysis. Computing separate p-values is not an appropriate analysis when there are forking paths. Instead I recommend hierarchical modeling, as discussed in my paper with Hill and Yajima. My problem with the paper discussed above is not that they were honest about their research decisions, it’s that they used inappropriate analyses and took post-selection p-values as strong evidence in favor of their theory.

      I have lots of tolerance for the subtleties and ambiguities that more complete reporting produces. But I don’t have a lot of tolerance for making strong claims based on p-values obtained from this forking-paths environment. Maybe the researchers were trying their best and didn’t know any better. I’m not saying they’re bad people. But they made a statistical mistake. Statistics is tough. Unfortunately, as long as Psychological Science keeps publishing this sort of paper, it provides an incentives for future researchers to fly blind in this way. And I don’t see this sort of noise-mining as doing the science of psychology any favors.

      • Hi Andrew,

        Thank you for responding to my comments. I have a few follow-up questions/points.

        1) Whose incredulity are we supposed to use to judge? With findings about priming and embodied cognition, it doesn’t seem particularly outrageous to me that a response to physical instability could easily influence our perceptions of other things in the moment as well, including romantic relationships. Part of the reason we use the peer-review process, as flawed as it is, is because experts in the field have the judgment to decide not only if a study is worthy, but if its claims are reasonably supported in the context of the theories and findings in a particular field. The hypotheses in this paper are supported by a vast amount of work on priming and embodied cognition research; if you want to question the field in general on these findings that’s one thing, but throwing one particular paper under the bus for having surprising results that really aren’t that surprising in the context of the field doesn’t seem to be fitting with adjusting personal estimates of probability in the face of evidence.

        2) Thank you for the paper recommendation, I’m looking forward to reading it. It seems like you’re assuming that the authors have analyzed all of these paths and selectively reported the ones that supported their analyses without any evidence, or rather even in the face of contradictory evidence: as Simmons et al. demonstrated, almost anything can be made to be significant at p < .05, the fact that these results are not traditionally perfect should call in to question the idea that this was a demonstration of forking paths analyses. From what I understand of the term, it's when alternative analyses were equally plausible given the hypothesis, leading to the selective reporting of analyses that worked. Is it really plausible to include participants who didn't follow the instructions or who correctly guessed the hypothesis? Is it really a forking path to consider exclusive romantic relationships? The choices that you've highlighted here don't seem arbitrary, which I thought was a defining characteristic of the idea of forking paths, though I likely have a very different set of priors for what seems like a reasonable decision. As a side note, I strongly believe data should be made publicly available, which would really help in determining the extent to which particular results are due to the forking paths problem.

        I think the heart of my disagreement is that you've chosen to believe the worst about the data and the researchers, far beyond what's supported by the paper – there are lots of positive signs here, and you seem to take your own personal incredulity (which flies in the face of the current evidence in the field) as grounds to believe that all kinds of invisible mistakes have been made in the analyses. I don't think it should be possible for us to disagree about this kind of thing without a very flawed model of disseminating research, but it seems like your conclusions are speaking beyond what the available information shows.

        • “With findings about priming and embodied cognition, it doesn’t seem particularly outrageous to me that a response to physical instability could easily influence our perceptions of other things in the moment as well, including romantic relationships.”

          I was under the impression that most of the findings about priming and embodied cognition had been discredited (e.g., failure to replicate, flaws in analysis).

        • Priming itself is an extremely well-replicated phenomenon. With social primes (a subset of priming research), there have been a large number of studies demonstrating the effect with a few notable failures to replicate (Doyen et al. being one of the most notable due to the online discussion that ensued). Many of the replications haven’t taken into account advances in moderators that flip around the behavioural effects of social priming research. That being said, there have been some failures to replicate, as well as successes. My understanding is the general consensus is that some of the initial effects were type I errors, that many were not, and that the ideas and effects of social priming in particular and priming in general still have substantial support.

        • Daniel:

          1. I’m not throwing one particular paper under the bus. I’m saying that many papers, including this one, have what is essentially zero evidence and take this as strong support for their model. If they want to theorize, fine. Regarding plausibility, one might equally argue, for example, that being primed with elderly-related words will make college students walk faster, as this would prime their self-identities as young people. Whatever. Theories are a dime a dozen. These papers get published because they purport to have definitive evidence, and they don’t.

          2. You write, “It seems like you’re assuming that the authors have analyzed all of these paths and selectively reported the ones that supported their analyses.” No no no NO NONONONONONONONONO!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! The whole point of the garden of forking paths is that these p-values are invalid, even if the researchers only applied a single test to their particular data. Please see the title of this paper. This is one reason I have come to dislike the terms p-hacking, fishing, etc.: they (a) imply that people whose studies have these problems have cheated in some way, and (b) imply that if a researcher has not cheated, that his or her p-values are valid. No.

          In any case, in this study the researchers explicitly admit to having forking paths. They write, “Although we observed only a marginally significant effect of posture condition on perceived relationship stability, it is widely accepted that indirect (i.e., mediated) effects can be examined even in the absence of any direct link between a predictor and outcome” (it seems pretty clear that, had they observed a statistically significant main effect, they would’ve reported it as a success). And they write, “we observed no direct or indirect effects of stability condition on card design intimacy, so we do not discuss it further” (it seems pretty clear that, had they observed a statistically significant difference here, they would’ve reported it as a success).

          You write, “you’ve chosen to believe the worst about the data and the researchers.” No I haven’t! I wrote, “Maybe the researchers were trying their best and didn’t know any better. I’m not saying they’re bad people.” I’m explicitly not choosing to believe the worst about these researchers. What I am saying is that they’re doing bad statistics. That’s just the way it is, it’s not believing the worst about anybody. If you were to see me out on the basketball court, you could correctly describe me as playing bad basketball. That’s not believing the worst about me, it’s just a description.

          You write that my “conclusions are speaking beyond what the available information shows.” No. I’m not making any conclusions about whether wobbly stools lead to wobbly relationship beliefs. Not at all. My only conclusion is that the authors in this paper are not presenting strong evidence for their claim. Low p-values, in the presence of the garden of forking paths, are not strong evidence for anything.

        • Hi Andrew,

          Thanks again for taking the time to respond.

          1) Regarding priming, I’m assuming you’re referring to Bargh’s paper and then failed replication attempt by Doyen et al. One of the most frustrating things about that exchange is that the field had already moved on to a deeper understanding of social category priming, with the Cesario, Higgins and Plaks 2006 “motivated preparation to interact account”. They found that people who like the elderly walk slower when primed, while people who dislike the elderly walk faster (among other findings). I don’t really care whether Bargh got lucky with his sample or what was going on, but a replication attempt should take into account advances in the field, particularly with a moderator that can flip the effect around.

          These papers are not purporting to have definitive evidence for their theory – all (most?) theories are likely wrong or at least incomplete. Moreover, the role of this particular journal is to providing cutting edge research – there are other journals where far more work into supporting a particular theory or understanding mechanism is required. This journal is more for a couple of short flashy studies that show something interesting and other researchers may want to follow up on. They provide evidence, but any particular paper in any journal shouldn’t be held as definitive due to the nature of inferential statistics, the limitations with generalizability and our continually growing knowledge.

          This paper presents a great deal more than zero evidence – again, you are allowing your personal incredulity to colour your interpretation of the entire paper.

          2) I’ve read up a little on your idea of forking paths, so please feel free to correct me if I’m wrong. From what I understand, you are suggesting that the researchers were influenced by the data to conduct particular statistical tests which influence the likelihood of them finding what they’re looking for. I still don’t see how what they reported is an example of this happening, or how what they reported invalidates all of their evidence. They reported results that were inconsistent with their hypothesis (such as the greeting card DV). Yes, the results would have been stronger if that DV had matched their expectations, but as they reported that DV, you as a reader are able to evaluate the full strength of the evidence. If they hadn’t reported it, you would be misled as to the strength of the evidence, but that’s not what they did. What exactly is the alternative here? It seems like they could either hide the results they didn’t match the analysis, or mention that they were there so that you’re not mislead. You suggested preregistering hypothesis, but these data were collected before that trend widely existed, and it seems to be only your personal intuition that the researchers were subtly influenced by their data set to get the significant findings that they did (despite a priori expectations regarding the findings and analyses, but again alas, this was before the trend of preregistration had picked up).

        • Daniel:

          1. I see no strong evidence. What I see is noise-mining. Each new study finds a new interaction, meanwhile the original claimed effects disappear upon closer examination.

          2. In this particular example, the researchers report many tests on the data, they discard the results that aren’t statistically significant and focus on the statistically significant results. The resulting p-values are close to meaningless; they aren’t strong evidence of anything.

          You write that I “suggested preregistering hypothesis.” Where did I suggest that? I did a quick search of the above thread and did not find where I suggested preregistering. But maybe I missed it? Anyway no I don’t suggest preregistering. What I suggest is doing all comparisons and using a hierarchical model. Not taking the most prominent interactions and using unadjusted p-values. I don’t think it makes sense to summarize inference with p-values in any case, but if you are going to do it, I think you should do it right.

          Again, I can understand how the researchers came to do this—it’s standard in many fields, including psychology, and this is the sort of thing that gets published in top psychology journals. But, no, it’s a recipe for chasing one’s tail.

  5. Andrew, after covering some of the concepts introduced to me by your blog to my students I had them read and critique this very paper for an assignment. You’ll be pleased to know that about half of the class actually used the term “garden of forking paths”.

  6. I guess Canada has enough money to burn on this kind of stuff. Or funding this might be a clever way of generating inflation and making obnoxious social scientist shut up at the same time.

  7. Andrew,
    Have you ever considered a “Towards a Transformative Hermeneutics of Quantum Gravity”-like submission to Psychological Science? I know nothing of the journal beyond what I read here but it sure sounds ripe for a contribution of that sort.

Comments are closed.