How to think about a Psychological Science paper that seems iffy but is not obviously flawed?

So I open the email one day and see this:

hi, Andrew – FYI, here’s another paper from the Annals of Small-N Correlational Studies, also known as Psychological Science:

hope all is well!

The research paper he was referring to is called “The Ergonomics of Dishonesty: The Effect of Incidental Posture on Stealing, Cheating, and Traffic Violations,” and it’s by Andy Yap, Abbie Waslawek, Brian Lucas, Amy Cuddy, and Dana Carney.

Regular readers will know I’ve been ragging on Psych Science recently, and I just looove the name “Andy Yap” (I don’t know the person, I just like the name because I’m named Andy and I like to yap; I guess in the same way that someone named Andrew who loves jello might find my name amusing), and the paper itself seems eminently mockworthy, with goofy images like this:

Screen Shot 2013-06-29 at 3.53.06 PM

My correspondent continued:

Please don’t quote me on this, but in study 4, they report p < .001 (ooh!)...then, oh yeah, they figure out that a bigger seat = bigger car (gee, we they think of that until *after* they reported p < .001), so then they control for car size...and the seat effect has an un-sexy p=.087 (yes, they report their super-exact p to 3 decimal places). I need a drink!

I took a quick look at the paper and here’s what I thought:

Study #1 looks pretty clean, although it’s possible that the experimenters unwittingly manipulated the money-passing condition, as they knew which participants had open stances and which had closed stances. Other than that possible problem, that experiment seems clean, no?

With experiment #2, I wonder whether it was just easier to cheat in the expansive condition. But, again, I don’t know if that’s a legitimate criticism or me just searching for a possible problem with the study.

But experiment #3 seems ok, no?

I agree with my correspondent on study #4, that it is supportive of the general claims in the paper but is but nothing more than that. I wouldn’t really take study #4 as any kind of independent evidence.

Overall, though, this all looks much stronger than the arm-size and political attitudes study, or the ovulation and political attitudes study.

So, although this whole “embodied cognition” thing is pretty controversial, and I’m not a fan of the style in which the paper is written (lots of presentation of positive result and not so much on the warnings; for example, in study #4 it is explicitly stated that the measurement is taken by “hypothesis-blind research assistants,” but in the other studies there is no mention of this, which I take to mean that the measurements were not hypothesis-blind but the researchers didn’t want to emphasize that point). But, at least from a statistical standpoint, most of the results seem clean, not a lot of mucking around looking for subsets and interactions and controls.

It was easier for me to get a handle on the political studies (for example, not trusting at all the huge effects claimed in the ovulating-and-voting analysis), but I don’t know enough about plain old psychology to have a sense of how to think about these embodied cognition experiments. I wish the study was crappy and I could just comfortably sit here and mock.

Maybe some of you can help me out on this?

24 thoughts on “How to think about a Psychological Science paper that seems iffy but is not obviously flawed?

  1. The easiest way to tell if a finding in a journal like psych science is dubious is to examine the sample size. In this paper, Studies 1-3 have samples that are less than 100 (Study 2 is particularly small with 34). The average effect size in social psychology is an r = .21, which requires upwards of around 150 participants to have sufficient statistical power to find even the average effect of social psychology. These studies fall well short of this benchmark. Add in the fact that these manipulations are relatively subtle, one would expect smaller than avg and not larger than avg effects (unless we believe that our honesty is based mostly on subtle contextual cues). This is how I reached the conclusion that these findings are likely too good to be true.

  2. I don’t have a problem with the idea of the study overall. It looks at small scale, fairly small effects of induced context. It can’t be extended to discuss honesty generally, but it’s an offshoot of the sort of work used, for example, in designing office or public spaces to minimize theft and increase sociability, etc. But I don’t accept the level of the findings – like 78% of the expansive posture people kept the money versus 38% of the contractive posture people – because of the obvious limitations.

      • Of course they do. It’s no fun if you just describe a bunch of people either taking or not taking $4 extra. But they really refer to situational honesty and there is work on that, from how store design affects shoplifting to how contextual clues alter your perception of how much you ate. And spend a few minutes talking with a sleight of hand artist – or a genuine con man – and you understand how they use contextual clues to shape responses. So I would continue to say there’s nothing wrong with the idea of the study. I think the limitations of the data and presentation are fairly blunt.

  3. I am a cognitive psychologist, so my expertise only somewhat overlaps with the content of this article. I don’t know if this is really a crappy study, but it’s far from being excellent work.

    Experiment 1 has solid statistics (post hoc power is 0.95), but footnote 1 indicates that they combined data sets from two different samples. This is not necessarily wrong, but it is odd and open to many other possibilities (would they have combined the data sets if doing so rendered a significant result non-significant?).

    One of the key findings in Experiment 2 is reported as just at the criterion for statistical significance (p=.05). Recalculating the p value gives .054, so I guess the authors rounded down, even though technically this is a non-significant finding. Computing post hoc power (assuming the .05 criterion) gives 0.47.

    One of the key findings for Experiment 3 is just below the criterion for significance (p=.046; post hoc power 0.51). This is one of several hypothesis tests, including a mediation analysis that requires both a significant effect (p=.03) and a non-significant effect (p=.09) to make its inference. The probability of the mediation analysis working out as desired must be rather low, but I’m not sure how to compute it.

    For experiment 4 a key result is described as being marginally significant (p=.087). If we assume that the criterion for being marginally significant is .1, then post hoc power is around 0.50.

    Applying the Test for Excess Significance (Ioannidis & Trikalinos, 2007), we multiply the estimated power values to get 0.11. This can be considered an estimate of the probability that repeating these experiments and analyzing them in the same way would produce samples that show results at least as good as these. I find the probability to be disconcertingly low.

    So, across the experiments there is: merging of data sets, rounding down of p values, changing criteria for statistical significance, and overall low power.

    Given the information in the footnotes, it seems to me that the authors are trying to provide full information about sampling and analysis methods. This is good, but I think this disclosure largely displays how confused some researchers are about statistical analysis/inference and how some of the results are quite fragile. Take footnote 8, which links back to a significant F test (F=4.81, p=.032). The footnote explains that two participants (out of 71) who had trouble with the experimental task were removed in an “a priori” decision. Keeping those participant’s data radically changes the result: F drops to .5. Either the removed participants had incredibly extreme scores; or the original results were due to some extreme scores. Participant removal may be justified, but it implies that the findings are rather fragile.

    • Thanks for this analysis. It is interesting how the (quite reasonable) complexity involved in some of these sample sizes makes such power calculations more difficult.

      • I prefer to use the term “design analysis” rather than “power calculation” to emphasize that I’m interested in replication, but I’m not particularly interested in the probability that a comparison is statistically significant.

        • I agree about the (lack of) interest in statistical significance. On the other hand, I think it is pretty clear that statistical significance is what mattered to the original authors. It is in that spirit that I find the power analysis interesting. I think we (psychologists) need to use better model development and analysis methods so that we can better evaluate the connection between data and theory.

  4. Criteria number 1: Did they register a study protocol before recruitment of experimental subjects?

    If not triple the p-values (ok I made the blow up factor up but you get the point)

  5. I’m inclined to agree with Richard Gill who yesterday commented on my blog:
    “If you believe his field (social psychology) is “science”. Think up a cute social observation or guess. Eg vegetarians are nicer people than non-vegetarians. Do a little experiment with 50 psychology students who gain credit from participating, confirm your theory, and publish. Make sure the results are reported in the newspapers! The next generation of psychology students earned their degrees by merely answering questionaires for “scientific” purposes!”
    I also offer a speculation, in my reply to him, why such “research” might nevertheless continue to be with us.
    I’d be glad to have any reactions.

  6. Any handy rule of thumb for making a point harder to refute than to prove ? [think… of the -1 / +1 button’s popularity]

  7. I’m not inherently skeptical of psychology papers about “priming,” because that’s what a lot of marketing and entertainment is. Say there’s a movie scene about a naive hero who goes into an office and gets cheated by a crook behind a desk, and the filmmakers want to imply to the audience that there’s something not quite right about what’s going on. The combined talents of the screenwriter, director, casting director, actors, set designer, costumer, editor, composer, etc. are usually quite effective at priming the audience to intuit what they are supposed to intuit. They’ll take care over things like the size of the desk and the posture of the person at the desk. In general, I think psychologists could generate more and better hypotheses for testing by sitting down with professional entertainers and advertisers to get them to divulge some tricks of the trade for priming audiences.

  8. As I read the abstract, I thought “sounds a lot like the research Diederik Stapel was doing in the last few years” (before he got caught making up quite a lot of his data). Then I noticed that two Stapel-co-authored papers were cited. I check but they don’t seem to be among the 50+ of Stapel’s articles that have been retracted. This doesn’t, of course, tell us anything else about this study.

  9. Lots of evidence for non-conscious effects on decision making.

    That said, we can be confident that effects such as those reported in this study are muted. Otherwise, they’d be subtly exploited, for example, by drill sergeants, therapists, and self-help authors.

    Or have the been exploited?

    There’s a batty book by Felicitas Goodman called “Where the Spirits Ride the Wind: Trance Journeys and Other Ecstatic Experiences,” which makes the argument that specific body postures, including open body postures, recur throughout religious traditions from their effects on consciousness. Any serious consideration of this question would need to account for Galton’s problem, and much else. However maybe there’s something too these ideas, despite Goodman’s bad reasons. Maybe not…

    As for this study, I agree the results are over-sold. Psychological Science nearly requires that sort of hype as a precondition of acceptance.

    • Suppose experimental design, implementation, and analysis are ok. That still leaves the subtle question of interpretation:



  10. Does anyone else think that the effect size in Experiment 1 is too good to be true? Was the person handing out the money blind to the hypotheses and to the respective conditions of the participants?

    • Nick:

      I’m guessing that the experimenter in study 1 was not blind, for the following reason: In study #4 it is explicitly stated that the measurement is taken by “hypothesis-blind research assistants,” but in the other studies there is no mention of this, which I take to mean that the measurements were not hypothesis-blind but the researchers didn’t want to emphasize that point

      • Yes, that was part of my motivation for asking the question. It seems to me that there are lots of subtle ways to “accidentally” include a $5 in a set of four bills, with substantial difference in /a/ how noticeable it is and /b/ how your demeanour conveys the acceptability or otherwise of keeping the excess.

        • Nick,

          Yes, that’s what I was thinking in my post, when I wrote that it’s possible that the experimenters unwittingly manipulated the money-passing condition, as they knew which participants had open stances and which had closed stances.

          Also there’s Greg Francis’s comment that the authors had various degrees of freedom in combining their data to choose what comparison to test.

    • Yes, that was a disturbingly non-skeptical news article. As an extra benefit, it featured a reference to a TED talk! And I just loove the suprising bit that big cars are more likely to take up multiple parking spots…

Comments are closed.