“P-hacking” and the intention-to-cheat effect

I’m a big fan of the work of Uri Simonsohn and his collaborators, but I don’t like the term “p-hacking” because it can be taken to imply an intention to cheat.

The image of p-hacking is of a researcher trying test after test on the data until reaching the magic “p less than .05.” But, as Eric Loken and I discuss in our paper on the garden of forking paths, multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time.

I worry that the widespread use term “p-hacking” gives two wrong impressions: First, it implies that the many researchers who use p-values incorrectly are cheating or “hacking,” even though I suspect they’re mostly just misinformed; and, Second, it can lead honest but confused researchers to think that these p-value problems don’t concern them, since they don’t “p-hack.”

I prefer the term “garden of forking paths” because (a) it doesn’t sound like cheating is necessarily involved, and (b) it conveys the idea that the paths are all out there, which is essential to reasoning about p-values, which are explicit statements about what would’ve been done, had the data been different.

In the ideal world we wouldn’t be talking about any of this stuff; but, given that we are talking about it, I’d prefer we keep the insights of Simmons, Nelson, and Simonsohn but get rid of the term “p-hacking.”

21 thoughts on ““P-hacking” and the intention-to-cheat effect

  1. I can assure you, there are labs/PIs that do the p-hacking you’re affraid of calling out.
    The process I was taught: Run a few subjects, look at a few different dependent variables, find the most promising, tweak metholdogy till it’s “right” then run the req number of subjects with the new metholodgy till you’re significant.
    It wasn’t till I did my own reading in the subject that I learned how wrong that was.

    • Agreed. This was popular in my psychology department. Find a significant effect and then work backwards in the theory and sprinkle a little bullshit on it to make the results and theory jive.

  2. I see the value in both terms but I agree that they should not be used interchangeably. P-hacking to me denotes the kind of “deep dive” into the garden-of-forking paths that the Wansinks of this world (and also Dana Carney) have basically admitted to doing. The garden-of-forking paths, on the other hand, is an issue for all researchers. I am already trying to think of alternative analytic specifications in all my current work and am at least aiming to present multiverse analysis in the supplemental materials of all future submissions. How reviewers will feel when any inference is not perfectly robust to data analytic specifications remains to be seen. It’s already led to some very interesting discussions with graduate students about what the full range of plausible alternative specifications are for any analysis.

  3. I like the term p-hacking because it has some teeth to it. It sounds like something you want to avoid. Garden of forking paths doesn’t sound like a big deal.

  4. I would keep the terms “p-hacking” and “forking paths” separate for exactly the reasons that you prefer the latter term over the former. Forking paths does not involve researchers *deliberately* hiding the contradictory nonsignificant results, whereas p-hacking does involve this intentionally selective reporting. So, like you say, “p-hacking” does give the impression of cheating, and its distinction from “forking paths” demonstrates that there can still be a problem even in the absence of this cheating, which is the key message I take from your forking paths paper.

  5. What would you call it when single-subject designs are appropriate, but a group design is used anyway. Perhaps it should depend on the “motives” of the researcher…if misinformed, call it “stupid.” If the goal is to get away with much less work, call it “Freakin’ lazy.”

    • Glen:

      I discussed the issue here. Researchers often use between-subject designs when they should use within-subject designs. I think this comes from a combination of four factors: (a) misguided concerns about bias, (b) lack of understanding of the importance of variance, (c) lack of realization that measurement itself is a challenge, and (d) research practices that allow the declaration of statistical significance from almost any possible dataset.

      I’ve spoken about this topic many times. I agree it’s important and I plan to write more about it and to do some research on the quantitative tradeoffs.

      • Yes, yes…but it doesn’t do a good job promoting SSDs (it is, though, just brief comment), and it endorses Normand’s poor scholarship when he says: “Psychology has been embroiled in a professional crisis as of late. . . one problem has received little or no attention: the reliance on between-subjects research designs…”

        Err…excuse me? Beginning in the early ‘30s, perhaps the most influential experimental psychologist of all time, B. F. Skinner, started the experimental analysis of behavior in which the methodology was always single-subject (“within-subject designs” is not, IMO, a good term as it conjures up the vision of RMANOVAs). Skinner’s work exemplified SSDs and he discussed “the other way to do behavioral research” dismissively and not in much depth. But behavior analysts quickly treated the issues of measurement, variance, reliability and generality systematically, and famously (big fish, small pond) as in Sidman’s Tactics of Scientific Research (1960). So, the issue of SSDs has received considerable attention and forms the basis of the natural science of behavior which is approaching 90 years old. And Sidman’s book is only the most famous – thousands of pages have been written on this topic. The *fact* is that it has received little attention from mainstream psychology which has, speaking colloquially, willfully ignored, and actively misrepresented, virtually everything about the science and its philosophy. Normand’s quote reflects this tradition and saying he ”wasn’t aware” says nothing – it is the scholar’s duty to “be aware.” There are other things upon which I could comment concerning the brief piece and the post to which this is a response, but I’ll leave it there for now.

  6. I’m not thrilled with any of these terms, mainly because they all misguidedly diminish the importance of exploratory data analysis, incorrectly making it seem wrong and bad. Instead, exploratory data analysis is utterly essential to science — it’s hard to learn anything unexpected without it! Andrew does it all the time on this blog (e.g., the mortality data). An alternative view of the replication crisis is that, due to perverse incentives, folks are trying to disguise exploratory analyses as confirmatory. Recognizing the importance of exploratory analyses, teaching how to do it right, writing textbooks about it, rewarding it (with academic credit) and maybe even establishing a norm that most confirmatory studies should include results of a preceding exploratory study would, in my view, go a long way resolving the replication crisis (at least in much of the social sciences).

    As for the term “garden of forking paths”, I had no idea what it meant, and had to think through all the issues myself. I suspect it is clear only to those who already know what it means.

    • Ed:

      I think exploratory work is great. I do it all the time, not just on this blog. Just about all the applied work I’ve ever done is exploratory. I’ve published exactly one preregistered replication in my entire life. So I think we’re basically in agreement.

      I’d love to stop talking about the garden of forking paths. But we need this concept to get past this sort of discussion:

      Scholar X: We discovered some amazing fact (for example, single women were 20 percentage points more likely to support Barack Obama during a certain time of the month).

      Me: No waaaay! This is ridiculous, it contradicts everything we know about public opinion. And patterns like this arise from noise allllll the time.

      X: You have no choice but to accept that the major conclusions of these studies are true. The p-value is less than .05.

      Me: “p less than .05” tells me nothing. You have tons of researcher degrees of freedom. You’re p-hacking!

      X: No, I only did this single analysis on my data, and that’s what I published. I didn’t hack.

      Me: That doesn’t matter. The p-value is a statement about what you would’ve seen had the data come out different. That’s the garden of forking paths. I cannot take your data analysis seriously, because I have every reason to think that, had the data been different, you would’ve done different analyses. You didn’t choose your data processing and analysis plan beforehand.

      If people would stop justifying their claims using null hypothesis significance testing, I could shut up about the garden of forking paths.

      • “If people would stop justifying their claims using null hypothesis significance testing, I could shut up about the garden of forking paths.”

        Yes we could get back to real actual science, where they specify a model for how public opinion comes about, and then fit its unknown parameters using Bayes and I can come along and say “I think your model *of what happens* is totally unrealistic, here’s an alternative, and by the way it fits reality better”

        Instead we’re stuck with millions of researchers essentially saying “ooh look at this shiny slot machine, it produces random numbers that couldn’t possibly be boring, I have a far superior slot machine to your null slot machine!”

      • Yes, I now understand the logic of forking paths, but the term (and your paper) didn’t really help me get there (except that, importantly, they forced me to think about the problem). Far from saying shut up about forking paths, I think the forking path problem is far more pervasive than sexy claims like voting patterns or the power pose. I, and most of my colleagues I would guess, would see a pattern in our data, e.g., females have higher levels of X than males, and wonder, is that “significant”? and do a standard NHST, a fork, as we did not think to test that prior to seeing the data. This happens ALL THE TIME. It is routine. The irony is, noticing that females are unexpectedly higher than males on X is a legitimate exploratory observation, one that the researchers might want to test in a future (preregistered) confirmatory study.

        My question is: is “forking paths” the best way to teach this concept? It wasn’t for me.

  7. I would add that hacking has both positive and negative connotations, but mostly positive if you look at how the word has been used over the past 50 years. Look at the way “hacks” and “hacking” has replaced “tips” in Internet media vernacular, conveying a sense of technical insight into otherwise ineffective systems. Google “cleaning hacks” or “life hacks.” “Tip” is just so… analogue.

    Even “hack” as applied to computing has a positive meaning, viz, “An inelegant yet effective solution to a computing problem; a workaround, a short cut, a modification.” (OED). And “a computer hack” denoted someone skilled, unlike the disappearing journalistic noun. Hacking has been extended to “unauthorized access” in its most obvious negative use; but that’s mostly it. I doubt whether most people outside of statistics would assume that “p-hacking” is something bad because it uses the word hacking; there is no semantic clue to suggest that the solution is false. Indeed, one could argue that p-hacking is a contradiction in terms unless you define “p” as publication in an academic journal.

    The “garden of forking paths” is also problematic in that it is counter intuitive: a garden is usually enclosed—and even when not enclosed, there is always an implied boundary, spatially, or in terms of property ownership. A garden is definable and knowable—and often symmetric and planned. A forking path will take you back to where you began—or a terminus where you have to retrace your steps. To take one fork over another is not to enter a different garden, incommensurable with the garden you would have entered if you’d chosen the other fork. Garden theory also occupies an important role in the history of art and architecture with arguments over formality vs openness. Wild landscape was Nature; the Garden was Art. Then Capability Brown, Edmund Burke and others advocated against regularity in garden design. And so on. My point is that for most people, all these meanings are more easily accessed and make more sense than Borges’ garden of forking paths.

    One can argue whether this is a successful or unsuccessful metaphor for Borges’ metaphysical explorations within the story. It is possible that it is meant to be intentionally subversive—an ironic comment on the fictionality of fiction in that for all the possible worlds imaginable in the story text, the text itself is a “bounded” garden, which is to say, a narrative without forking paths. The garden of forking paths is a linguistic illusion.

    But I think it is a challenge to interpret statistical practice through Borges use of the phrase and not just because Borges’ meaning is slippery, or because you are engaging in data analysis and not metaphysics. Few people will have read his story; therefore, they will fall back on making sense of the phrase through other uses of these terms.

    As for “researcher degrees of freedom,” how many people will understand the idea of delimitation versus a panoramic perspective? It too hobbles on a common “sense” reading.

Leave a Reply

Your email address will not be published. Required fields are marked *