Comments on “Improving the Dependability of Research in Personality and Social Psychology: Recommendations for Research and Educational Practice,” the report of the SPSP Task Force on Publication and Research Practices

On the sister blog we’re supposed to switch to catchy headlines to match our Big Media presence. I actually think that’s a good idea. It’s not easy to write good titles, and it’s worth trying to do better in that regard.

But coming up with catchy titles is hard work, and it’s a bit of a relief to be back here, free to do whatever I want. So, just for fun, I tried to come up with the most boring (while remaining accurate) title for this post. Above, you can see what I came up with!

OK, now to the real story. Dan Wisneski sent me a copy of this document [updated version here] by David Funder, John Levine, Diane Mackie, Carolyn Morf, Carol Sansone, Simine Vazire, and Stephen West, that’s been “starting to make the rounds in social psych circles.” The paper in question addresses the now-familiar topic of unreplicable research, the sort of linkbait which often seems to get published in “Psychological Science” nowadays.

As an outsider to the field, my impression is that concern over this issue started to increase following John Ioannidis’s famous 2005 paper, “Why most published research findings are false” (duly cited in Funder’s report) and then got kicked into high gear after the 2011 publication in JPSP of Daryl Bem’s ridiculous paper on ESP.

As Mickey Kaus might say, the media treatment of Bem’s paper was just like that of any other bit of science hype, only faster: the entire cycle, from release of the preprint to announced publication to credulous news reports (after all, JPSP is a top journal, and science writers know that) to skepticism to controversy to debunked, all seemed to happen in just a few days.

To say it in a way that you will all understand: Daryl Bem’s ESP paper is the “The Rutles” of science reporting, brilliantly and concisely capturing, in deadpan style, all the tropes in just the right order, with no distracting details to get in the way. (It was, for example, convenient that, there was no associated scandal and no political valence to the research. That way, people had to react to the story itself and could simply rely on preprogrammed ideological reactions, as can happen with all those evolutionary psychology stories.)

In its stark, almost parodied series of steps, the Bem episode made a lot of people realize: Hey, this happens all the time! Maybe we can short-circuit the process next time and just jump straight to the last step of not believing the outrageous claim?

But that’s not so easy. Bem’s elusive ESP finding may be easy to discard, and maybe we can also laugh at the “dentists named Dennis” study (although I believed that one when it came out, and I remain sympathetic to the hypothesis). And what about the “think outside the box” study, or “stereotype threat“? It might be that none of these are real, but most of us don’t feel so comfortable rejecting all of them. Recall that even Brian Nosek, a central player on Team Skeptic, had his own pet hypothesis (“50 shades of gray”) which he had to abandon only after it failed to replicate.

To step back for a moment, here’s what a lot of thoughtful people were saying after the Bem debacle: In politics, they sometimes say that the scandal is not what’s illegal, but what’s legal. Similarly, the scandalous aspect of the Bem study was not that he was some sort of badass Mark Hauser who broke all the rules and just didn’t care, but rather that he followed all the rules. He did what everyone said you should do, and look where that left him! So, many thoughtful people said, instead of letting Bem twist slowly in the wind, we should figure out what went wrong so that this sort of paper can be routinely published.

That is, we need to change the rules. And that’s what this discussion is all about.

Here’s a key part of the story (from page 20 of the Funder et al. report):

The late meta-analyst John Hunter wryly offered his observations on the progress of research in many areas of psychology given that researchers often ignore considerations of effect size and statistical power. According to Hunter, a research area begins with the proposal of an interesting hypothesis and the excitement of a first demonstration study that finds a large effect size. Subsequent research tries to clarify the phenomenon by designing studies to rule out alternative explanations, thereby making the effect size smaller. This stage is followed by a generation of studies investigating mediation and moderation, which further reduce the effect size.

Or maybe there is no effect there at all. Or, perhaps closer to the truth in many cases, maybe the effect is positive in some settings and negative in others, with variation being high enough that substantially different effects will be found in different populations at different times under different experimental conditions.

Here’s what I like about the Funder et al. report: It is thoughtfully addressing important and real questions, and in my opinion its recommendations are generally going in the right direction.

Here’s what I don’t like about the report: It remains tied to what I see as an old-fashioned statistical approach based on power analysis (that is, statistical significance and “p less than .05) which in turn relies on a conception of science that turns on the discovery of nonzero effects. As alluded to above, in the human-science settings with which I am familiar, just about nothing is zero, but effects and comparisons can be highly variable, so much so that in many cases the idea of “the effect” of something does not make much sense.

In addition, I don’t really buy the Funder committee’s acceptance of the standard paradigm in which a researcher can specify a hypothesis ahead of time and then simply test it. I do believe that pre-registration of research hypotheses is both possible and a good idea, but I think this preregistration makes most sense as the second half of a study (as in the above-noted Nosek et al. paper), following up on a more traditional exploratory (even if theory-driven) part.

I am devoting more space to what I don’t like about the report than what I do like, so before going on I should probably emphasize that, overall, I think the report is a big step forward.

Let me say this another way. The report is completely reasonable. Indeed, it gives the sort of advice that I might have recommended to practitioners, three or five or more years ago. But, given what I know now, I don’t think they go far enough.

Here are the report’s recommendations for research practice:

1. Describe and address choice of N and consequent issues of statistical power.
2. Report effect sizes and 95% confidence intervals for reported findings.
3. Avoid “questionable research practices.”
4. Include in an appendix the verbatim wording (translated if necessary) of all independent and dependent variable instructions, manipulations and measures. If the manuscript is published, this appendix can be made available as an on-line supplement to the article.
5. Adhere to SPSP’s “Data Sharing Policy” which states that: “The corresponding author of every empirically-based publication is responsible for providing the raw data and related coding information . . .
6. Encourage, and improve the availability of publication outlets for replication studies.
7. Maintain flexibility and openness to alternative standards and methods when evaluating research.

I’m 100% with them on items 4, 5, 6, 7. I recently had an experience in which I had some difficulty commenting on a paper that had been published in a top journal, where neither the article nor the supplemental material ever gave the survey questions used in the study, nor was there a clear statement of the data-exclusion rules or the raw data themselves. This should always be available—but if it’s neither a requirement nor a norm, we can’t expect people to share this information, Not because of secrecy but simply because it’s effort to put it all in there, and the #1 goal in writing a paper is typically to get it accepted, not to provide information for later researchers.

As for items 1, 2, 3, I think they represent a good start but we can do better:

1. The choice of N is important, and I’m completely in favor of design calculations; I just think they should be decoupled from “statistical power,” which is a very specific idea that is tied to statistical significance. Design calculations are relevant whether or not statistical significance is going to be part of the story.

2. Effect sizes and 95% intervals are fine but they don’t really solve the key problem of the statistical significance filter. When we focus on statistically significant results, we will systematically overestimate the magnitude of effects, sometimes by a huge amount. So the corresponding effect sizes will be misleading (yes, in some settings an expert can see the unrealistically high effect size estimates and scream that there is a problem, but what seems more common is people just accepting these ridiculous numbers (e.g., more beautiful people being 8 percentage points more likely to have girl babies, ovulating women being 20 percentage points more likely to vote in a certain way, women at more fertile days in their menstrual period being 3 times (!) more likely to wear certain colors of clothing) at face value. And confidence intervals can be even worse, in that the extreme endpoint of the confidence interval can well be out in never-never land. You’ll see this a lot in epidemiology studies, where the 95% interval for the risk ratio is something like [1.1, 8.5], and realistically we don’t believe it could be much higher than 1.5 or 2. The only point of the interval is that it excludes zero.

I’m not saying that interval estimates are useful. It’s important to have a sense of inferential uncertainty. But with small sample size (or, more generally, sparse data), classical confidence intervals don’t look so good. They’ll include all sorts of unreasonable values.

Funder et al. write, “Confidence intervals can be easily constructed for most types of effects, but sometimes complications arise. Some confidence intervals (e.g., for the Pearson r) are not symmetric and require a normalizing transformation (e.g., the Fisher r to z transformation); others do not have a known mathematical solution and can only be constructed empirically through repeated sampling procedures (e.g., bootstrapping)”—but all that has nothing to do with anything. The problem is much more fundamental than that.

3. Funder et al. want researchers to avoid questionable research practices and avoid “procedures that look at the results and then tweak the data post hoc to achieve statistical significance,” including:

(1) conducting multiple tests of significance on a data set without statistical correction; (2) running participants until significant results are obtained (i.e., data-peeking to determine the stopping point for data collection); (3) dropping observations, measures, items, experimental conditions, or participants after looking at the effects on the outcomes of interest; and (4) running multiple experiments with similar procedures and only reporting those yielding significant results. These practices may not be equally problematic; both (3) and (4) have particularly great potential to lead to serious inflation of the Type 1 error rate and yet not be recognized in the review process.

I agree, but I’m afraid this is close to useless advice, because the people who do these research practices don’t in general realize they’re doing it! Eric Loken and I have a whole paper on this topic, but, very briefly: what researchers are doing is, making data-analysis choices (including all of (1), (2), and (3) above, and also including (4) in the sense that they are choosing when to pool and when to compare-and-constrast when they do multiple experiments) contingent on data. But, because these researchers’ decisions are contingent on data, and they only see one data set, they don’t realize their multiplicity problems. You see researcher after researcher (Bem included) insisting that their data selection and data analysis are completely theory-driven, and thus not subject to multiple comparisons problems—but when you look carefully you see lots and lots of decisions that were not prespecified.

The point here is that different data sets would lead to different analyses, hence there’s a multiplicity problem even if only one analysis was ever considered for the particular dataset that was observed. The closely related point is that I fear the Funder committee recommendations will be misleading because there are lots and lots of researchers who don’t do “questionable” analyses as described above, but still have multiplicity problems because their data manipulations are contingent on the data they saw.

Finally, here are the report’s recommendations for educational practice:

1. Encourage a culture of “getting it right” rather than “finding significant results.”
2. Teach and encourage transparency of data reporting, including “imperfect” results.
3. Improve methodological instruction on topics such as effect size, confidence intervals, statistical power, meta-analysis, replication, and the effects of questionable research practices.
4. Model sound science and support junior researchers who seek to “get it right.”

I’m happy with all these (as long as item 3 is interpreted in light of my comments above regarding the problems with effect size, confidence intervals, and statistical power as general research tools).

P.S. I agree with footnote 5 of the Funder et al. report! And I suspect Hal Stern agrees with it as well.

19 thoughts on “Comments on “Improving the Dependability of Research in Personality and Social Psychology: Recommendations for Research and Educational Practice,” the report of the SPSP Task Force on Publication and Research Practices

  1. Can you recommend some further reading with regard to your comment on the choice of N and that it should be decoupled from “statistical power”? I see a lot of rules of thumb in my surroundings regarding the choice of N. But I don’t get further than the classical power analyses to justify my own decisions.

  2. Andrew: “… but all that has nothing to do with anything. The problem is much more fundamental than that.”

    Yes, that is how I feel about much of the teaching and practice of research nowadays.

      • Andrew:

        I am not saying all research is useless, not at all. In fact I find almost all research is useful in some way including conceptual, qualitative, quantitative, etc… I am very open minded in that respect, and not a nihilist.

        My point is simply that we seem to devote 90% of the time to 10% problems, and 10% of the time to 90% problems like fishing, reproducibility, unreliability, human error, checklists, the accumulation of knowledge, its practical use, etc… Which is to say, we ought to focus more on being somewhat right than precisely wrong; on practical implications than on theoretical niceties.

        Indeed, I would argue your work on Type S errors goes in precisely this direction: Focus on a lower inferential resolution to highlight what is really important. For example, smoking causes cancer but who can tell me of the top of their head what is the precise estimate, and how it adapts to specific people? All most of us know — including the regulators — is that smoking is an important cause of cancer for most people. On this basis we have built a massive regulatory system for smoking prevention and cessation (a good thing in my view).

        I doubt human organizations and institutions are currently in a position to use more precise inferential language for better optimization and policy making. Such optimization will require more delegation to computers to store, manage, and act on scientific knowledge. In areas like personalized medicine this is already happening. But I doubt it will be coming any time soon for macroeconomic management, the design of political institutions, and most things social science.

        In conclusion, I see a tension between low resolution inferential needs, on the one hand, and the increasingly high resolution of our scientific methods on the other. And in some ways these fancy tools — though incredibly useful when appropriate — are hiding important fundamental problems that, at least until recently, might have been regarded as “pedestrian”.

        P.S. I can show at a low level of resolution — e.g. using only the qualitative causal knowledge of graphs — situations were MRP will go completely wrong, and how to test for such situations.

  3. Thanks for this very thoughtful and helpful review. About statistical power, I think Geoff Cumming said it best, in the introduction to the chapter on power in his excellent new book. And I quote:

    “I’m ambivalent about statistical power. On the one hand, if we’re using NHST, power is a vital part of research planning… On the other hand, power is defined in terms of NHST, so if we don’t use NHST we can ignore power and instead use precision for research planning… However, I feel it’s still necessary to understand power…partly to understand NHST and its weaknesses. I have therefore included this chapter, although I hope that, sometime in the future, power will need only a small historical mention.” Geoff Cumming, Understanding the New Statistics, p. 321

  4. With respect to your points 2 and 3, this is fundamentally a researcher incentive problem, not a methods problem. It wouldn’t solve all of these issues, especially in observational work with well known datasets, but it seems like moving to a peer review model based on submission of theory and design (but no results) would go a long way towards eliminating the “statistical significance filter.” Researchers could still game a system like this (similarly, it would not be impossible to register a study that you had already completed…) but they’d have little incentive to, at least for publication, although post-publication concerns would still play a role (so, non-null results would probably be cited more, get more press, lead to more prestige, etc). This would make life much better for researchers (no more desk drawer problem, although I’m sure we’d just update tenure standards to compensate for this improvement), get null results (but, given good review, null results based on solid designs) into good journals, make meta-analysis easier, and so on. I know I’m not the only person who has ever suggested this idea but it seems to get surprisingly little play in this whole debate.

  5. I don’t want to get too far off topic from Dr. Gelman’s good work trying to take the Cult of Statistical Significance down a peg, but let me point out that just in the abstract, we should expect a high percentage of fashionable “priming” experiments to be non-replicable, even if they were perfectly executed according to perfect rules.

    Why? Because “priming” is basically marketing or advertising. They are tests to see if people can be induced to do something through various stimuli. Thus, reading about these kind of experiments are popular among people in the marketing industry, which is one reason why Malcolm Gladwell-like books reporting on priming experiments sell well at airport bookstores, where a large share of customers are involved with marketing.

    As an old marketing researcher, let me point out something fundamental about marketing: it isn’t physics. Marketing effects wear off over time. TV commercials that you found arresting and persuasive in the past would now strike you as stilted and tired. College students, who make up many of the subjects of priming experiments, are particularly sensitive to being influenced by new marketing, but they are also particularly sensitive to becoming bored by old marketing.

    Thus, it would be hardly surprising that some classic priming experiment from the 1990s — e.g., you can prime students to walk slightly slower to the elevator — might not work in the 2000s. Back in the 1990s, somebody managed to prime college students to wear lumberjack shirts and dance the macarena. That doesn’t mean you could as easily prime them to do that today.

    Of course, much of the appeal of Gladwell-type books to marketers is that he’s telling them things work this way because Science. And we all know that Science never changes, so if the marketers could only figure out the rules of Marketing Science from reading about priming experiments, then they wouldn’t have to work so hard chasing trends.

    But, that’s a pipe dream.

    • Steve:

      If marketing treatments wear off then the interesting scientific question is not replicating an old treatment but developing a theory of what makes a stimuli catchy, and what determines its half-life. Then we want to test the theory, and see if that replicates. E.g. can we replicate a song to be as catchy as Macarena? And can we make it last longer in the charts? I am sure this is amenable to scientific inquiry.

      In the food industry, for example, it is not how you rhyme your fast food — in a burrito, a bun, or a bowl — so much as the harmony of sugar, salts, and fats that gets people addicted, and coming back. You can predict how successful a certain food will be on its signature harmony.

      Of course, in the long run tastes change, bacteria evolve, and so on but that just calls for another layer of abstraction.

      • “Taco Bell’s Five Ingredients Combined In Totally New Way
        “Oct 14, 1998

        “LOUISVILLE, KY–With great fanfare Monday, Taco Bell unveiled the Grandito, an exciting new permutation of refried beans, ground beef, cheddar cheese, lettuce, and a corn tortilla. “You’ve never tasted Taco Bell’s five ingredients combined quite like this,” Taco Bell CEO Walter Berenyi said. “The revolutionary new Grandito, with its ground beef on top of the cheese but under the beans, is configured unlike anything you’ve ever eaten here at Taco Bell.” The fast-food chain made waves earlier this year with its introduction of the Zestito, in which the beans are on top of the lettuce, and the Mexiwrap, in which the tortilla is slightly more oblong.”

        http://www.theonion.com/articles/taco-bells-five-ingredients-combined-in-totally-ne,3781/

        Seriously, corporations know an enormous amount about how to provide the public with what it likes now. What they don’t know as much about is what the public will like next.

        • Well I don’t know much about marketing but my understanding is that it is the public that does not know what it will like next…

    • Steve:

      I agree with your point about effects not being persistent (or, more generally, that effect sizes vary by scenario: a given effect can be positive for some people in some settings and negative for other people in other setting). This is a point discussed in detail in this paper to be published in the Journal of Management.

      The framing of scientific findings as eternal truths is misleading, and I see it as related to the traditional approach in statistics textbooks (including my own) of analyzing individual studies as stand-alone entities.

      • Nice paper. I like the discussion of variation and uncertainty. Specifically a problem with much systematic variation (e.g. heterogeneity) is that effects will only be “stabilized” within narrow settings (e.g. specific people and contexts). But then we face a problem of uncertainty as those strata are typically sparse. Hence the allure of hierarchical Bayesian modelling.

        This is all great, but even here one needs to be wary of causal identification, and interpretation.

      • > framing of scientific findings as eternal truths is misleading
        How grue-esque or bleen-ish is that ;-)

        Telling folks not to pay too much attention to whether the null is in the CI may be very much like telling them to abstain from sex other than with long term partners. Arguably much better to inform them of the risks (type S and M errors) and especially before the act.

        By the way, when I was involved in doing cost-effect analysis, I can’t remember anyone complaining about using a prior for the unknown effect size, even if they were very anti-Bayesian. The base case, which did not include getting the direct of the effect wrong, was briefly written up (Detsky AS. JAMA. 1989:262: 1795- 1800).

        • Given the typos above, maybe I am not fully wake yet.

          cost-effectiveness analysis of funding RCTs

          get the direction of the effect wrong

  6. Two of the biggest rock guitarists of the 1970s, Brian May of Queen and Tom Scholz of Boston, are men of STEM personalities. May recently acquired a Ph.D. in astrophysics and Scholz had two degrees in mechanical engineering from MIT and was an important figure at Polaroid.

    It would be interesting to ask them about replicability in popular music. Judging from their career arcs — for example, when I moved in 1982, I found that while I could get 50 cents or a dollar for most of my unwanted records, I couldn’t even give away my Queen albums — I don’t think May or Scholz had all that much more insight into how to replicate the effects of priming audiences than their less scientific rivals.

    • To be more precise, I’m sure that May or Scholz could make up a highly insightful list of the (more or less) necessary conditions for 1970s rock stardom, but I doubt if they could tell you the sufficient conditions. The latter has more to do with what teenagers thought was cool at the moment, and thus is highly contingent.

Comments are closed.