Ethan Ludwin-Peery writes:

I finally got around to reading The Nurture Assumption and I was surprised to find Judith Rich Harris quite lucidly describing the garden of forking paths / p-hacking on pages 17 and 18 of the book. The edition I have is from 2009, so it predates most of the discussion of these topics, and for all I know this section was in the first edition as well. I’ve never heard this mentioned about JRH before, and I thought you might be interested.

Here’s the passage from Harris’s book:

It is unusual for a socialization study to have as many as 374 subjects. On the other hand, most socialization studies gather a good deal more data from their subjects than we did in our IQ-and-books study: there are usually several measurements of the home environment and several measurements of each child. It’s a bit more work but well worth the trouble. If we collect, say, five different measurements of each home and five different measurements of the child’s intelligence, we can pair them up in twenty-five ways, yielding twenty-five possible correlations. Just by chance alone, it is likely that one or two of them will be statistically significant. What, none of them are? Never fear, all is not lost: we can split up the data and look again, just as we did in our broccoli study. Looking separately at girls and boys immediately doubles the number of correlations, giving us fifty possibilities for success instead of just twenty-five. Looking separately at fathers and mothers is also worth a try. “Divide and conquer” is my name for this method. It works like buying lottery tickets: buy twice as many and you have twice as many chances to win.

And that’s not even the whole story, as she hasn’t even brought up choices in data coding and exclusion, and choices in how to analyze the data.

I replied that we’ve been aware forever of the problem of multiple comparisons but we didn’t realize how huge a problem it was in practice, and Ludwin-Peery replied:

Indeed! The most surprising thing was that she seems to have been aware of how widespread it was (at least in socialization research).

So is this “forking paths” or “p-hacking?” Seems to me, as she writes about it, it is the latter.

I have been influenced by your “forking paths” work, although I don’t think you’ve ever provided a clear definition. In a recent talk (in which I presented a multiverse analysis), I gave my own definition (will be curious what you think): Data analytic decisions not theoretically compelled, that we stick with because they work out.

The difference between what Harris writes and my definition is that the former involves searching for statistical significance, whereas the second is less searching and more ignoring alternative paths

Michael:

“Forking paths” is a general term referring to the idea that, had the data been different, the analysis could’ve been different as well. When speaking of forking paths, there is no assumption one way or another if the different choices of analyses were planned, or if multiple analyses were done on the existing data. In our paper we focus on the setting where only one analysis was done on the existing data, but the principles hold more generally.

Alternatively: The Smorgasbord of Analysis

Alternatively: The Garden of Forked paths is the Reductio ad absurdum of Frequentist philosophy,

To clarify: you don’t actually need to eat a little of every possible analysis, you can just choose the ones that look good, and claim that they were the lunch you always intended to have in the first place.

Sometimes it’s all pickled herring.

Or smoked herring

Or red herring

The same physical reality, same data, and same calculated numbers mean something different if you would have calculated something else in a different universe with different data.

There’s two options here. Either Frequentism makes sense and so does the Garden of Forked paths. In which case, we’re going to need to know a lot more about the Multiverse to make statistical inferences reliably. Or Frequentism is wrong and flawed in principle, and the Garden of Forked Paths is an all purpose, always applicable, explanation for the vast and numerous failures of Frequentism in practice.

Andrew gives no proof the “Garden of Forked Paths” explanation is correct, and it’s hard to imagine how anyone could prove it. The feeling most people get that it’s “true” seems to stem from the very strong correlation between the “Garden of Forked Paths” and bogus Frequentist conclusions. But this correlation is no mystery. The “Garden of Forked Paths” is always plausibly present, and the vast majority of Frequentist results are bogus, thus the correlation will necessarily be close to 1.

Anon:

Forking paths is the claim that, had the data been different, the analysis might have been different. The claim is a counterfactual and, as such, can never be proven. The reason why we think it is true is that when researchers are observed analyzing very similar problems but with different data, they often do a different analysis with each dataset. This can be seen, for example, in the ESP paper by Daryl Bem (several experiments with similar forms but with different analyses each time), Brian Wansink’s papers (different experiments with different data-exclusion rules each time), Satoshi Kanazawa (2 papers on beauty and sex ratio but pulling out different data summaries, each time finding something statistically significant, etc).

Another way to put it is that the counter-claim, that the analysis would’ve been exactly the same had the data been different, is highly implausible given real-world research practices, so if someone wants to make

thatclaim, I think they should have very strong evidence for it!“Forking paths is the claim that, had the data been different, the analysis might have been different. “

In other words:

1) data acquisition is the same in each experiment.

2) calculation method is the same in each experiment.

3) researcher varies the parameter comparisons until one that is “significant” is found.

isn’t that p-hacking though?

Why not split the data before performing the analysis and perform the same analysis on both splits?

Jim:

I think you are asking two questions here.

Your first question is about forking paths and p-hacking. For that, I’ll point you to our article. In your story above, step 3 is one way to have forking paths, but it’s not the only way. There are many researcher degrees of freedom: decisions of which data to include/exclude, decisions of how to code each variable, decisions of what variables to include in the analysis, decisions of what interactions to study, decisions of what to compare, etc etc etc. As we discuss in our paper, a researcher might make just one choice of all these for a particular dataset, but had the data been different, the researcher could’ve made other choices.

Your second question is about data splitting. Yes, that’s one statistical tool that can be used. Sometimes this can be a good idea, otherwise it won’t work so well. For example, if I have data from a survey to estimate public opinion, I could split the data and get two estimates, one from the first half and one from the second, but then at the end of the day I’m just gonna combine them anyway.

Thanks Andrew

First: OK, yes, #3/#4. in your paper, you say:

“a diﬀerent test would have been performed given diﬀerent data”

What you mean here is:

“a different test would have been performed given different *values*” Right? When you use the word “data” it’s not clear whether you mean “features” / “variables” or “values”.

I guess I’m still feeling around for a more fundamental principle that captures “researcher degrees of freedom”.

Second: splitting is a check on the result. In every other field of science its routine to split samples, even in analytical chemistry where the result of the analysis is hardly in doubt most of the time. So sure, if you split a sample in a gold assay you combine the results for the grade and tonnage calculation but you report the splits separately to demonstrate the variance in the sampling and analysis.

Andrew:

But when I read your forking paths paper I can’t help but notice all the other probs w/ the papers you and your coauthors critique.

This time I dug up Peterson et al. So I’m reading:

“…a highly significant interaction effect in all three countries—Argentina: F(1, 98) = 7.83, p = .003, r2 = .082; United States: F(1, 201) = 6.22, p = .007, r2 = .032; Denmark: F(1, 414) = 9.70, p = .001, r2 = .124 (one-tailed ps). “

OK, so you look at r2 and then look at their graphs and you’re like huh? Look at the way they squeeze the X axis to make the slope steeper! Freakin’ hilarious. Or tragic. So for all practical purposes as far as I can tell, they have two groups of men. Men of high standing tend not to support redistribution; men of low standing do. Arm strength is irrelevant. The slope of the correlation is – not reported, not surprisingly – but very shallow. Y doesn’t depend on X.

So that’s just bad science from the getgo. There’s much to criticize beyond researcher degrees of freedom. They blatantly ignore – and may be seeking to hide – very big and loud hints that their conclusions are wildly inaccurate.

Jim, I think the difference between the Forking Paths critique and the p-hacking technique is that Forking Paths is not a claim that you actually *calculated* the p values and threw out those analyses that didn’t give you what you want… This is like tasting every item on the Smorgasbord… instead Forking Paths is saying that researchers can “get a feel” for which analyses are likely to produce the kinds of results they want to see, and then just choose those analyses. This is basically like walking around the whole Smorgasbord and saying to yourself “that looks tasty”… later when you pick the thing off the table that looks the most tasty, it’s not surprising that it is tasty.

If there are 20 items on the Smorgasbord, and you choose 4 of them, you can make 4845 different possible plates. Suppose that on the Smorgasbord are 12 very obviously “specialty” items that only crazy Swedish people eat ;-) so there are choose(20-12,4) = 70 “tasty” plates for a non-specialty foods eater… That you can pick out those 70 possible tasty plates without actually trying all the unusual items is unsurprising, even though the probability of choosing one of those plates “at random” is only 70/4845 = 0.014

Daniel,

OK, thanks, that makes a bit more sense to me.

“The edition I have is from 2009, so it predates most of the discussion of these topics, and for all I know this section was in the first edition as well.”

I have an early hardcover edition (The Free Press, copyright 1998, 2nd printing, if I’m reading that correctly) and can confirm it’s in there, albeit on p. 19.

I think there may be earlier precedents for discussion of similar issues. One that comes to mind is

https://projecteuclid.org/download/pdf_1/euclid.ss/1009212243

about the Bible Code, in 1999.

Jon:

Oh, yes, the idea of fishing for statistical significance comes up a lot; it’s commonplace enough that Deb Nolan and I had a couple examples of it in our Teaching Statistics book. The thing that is new in the past decade or so is that forking paths, p-hacking, etc., are pervasive enough to cause us to cast doubt on entire subfields of science. It’s not just Bible Code, ESP, etc.

Similar issues have abounded in cosmology for quite some time.

The CMB (Cosmic Microwave Background) has been studied for quite some time to find statistical occurrences that are rare according to the standard model. But there are a plethora of such things one might look at. Given enough time and creativity, it’s almost certain that there will be anomalies found (perhaps the well known ‘cold spot’ in the CMB is just one of these).

So, how are these issues dealt with in cosmology? In the areas I work in, one would deal with this using holdout samples. But it’s hard to have a holdout universe (insert multiverse joke here). Do you look at one part of the sky for anomalies, and then replicate in another area of the sky, or … ?

It’s not dealt with in a particular way, except perhaps for people regard claims of statistical anomalies with suspicion! A holdout set on parts of the sky might work – except, many of the anomalies are spatial correlations, which makes such an approach more complicated.

A lot of cosmologists lean Bayesian – the idea that you’ve only got one universe is perhaps why.

This is very common in engineering too. It is driven in part by purely social dynamics coupled with low quality training. The problem is many researchers aim to confirm their own biases so as to publish more. They view statistical methods as tools for validating what they have done – much in the same way a small child views getting points in a video game. Divide and conquer methods seem quite natural from such a perspective, and their utility is confirmed by the results, which are judged on publishability, not scientific quality. Operationally significance means publishable. When journals crackdown on such practices, this is treated as clubbish elitism and the response is to start some new journals …

+1

I think there is nothing wrong in exploring the garden of forking paths if one does not look obsessively for “significance” treating istead p as a descriptive statistics among others. If good theories are not available, as it is commonplace in social science, exploring is the main think and honest researcher can do. As long as one is aware of what he or she is doing, I can see no harm in that. As some of the above comments show such issues can be found in the “hard sciences” as well. Human are flawed, their theories are incomplete, let’s embrace humility and accept the uncertainty of exploration, which may also include exploring the garden. That’s my very humble opinion.

The problem with “exploring the garden of forking paths” is that it can only be a description of the data. Everything is correlated with everything else. It is fine to describe that, it is not fine to call it a theory. After I have seen all of the points on the graph, I can connect them with an infinite variety of lines, but now I have no a priori reason to favor any particular line fitting the data. If I start with the line, and collect the data, and the points start to fall on or near the line, then I can say, it looks like I was on to something. Post hoc analyses just don’t get to call themselves “theories” or “explanations.” That, I think, is the point about the garden of forking paths.

Well put.

Kind of shame on those linking ‘garden of forking paths’ and ‘p-hacking’ to frequentism specifically. It gets the heat since it is the most successful and popular method of analysis. But of course ‘garden of forking paths’ applies to any analysis approach or decisions, just call all of that X. Now if you were to have done Y instead you may have obtained a different answer, obtained a different level of statistical significance or BF or posterior probability or whatever your method of assessment is (p in p-hacking can also mean “prior”. Just call it ‘summary statistic hacking’?), and made a different decision.

Justin

Or you could just create intervals for any parameters of interest which describe the range of possible values consistent with the evidence you have (i.e. bayesian intervals, CI’s are not the same thing despite being numerically similar in the simplest text book cases) and leave it at that.

Then it wouldn’t matter what else you did, or thought about doing, or would have done in a different universe.

Each time you make a decision (either significance test or Bayes factor test) you are in effect collapsing the plausible range for a parameter down beyond what the evidence allows.

This can be legitimate as an approximation. For example, you might have a distribution for lambda which is very sharply peaked about 0 and then, in effect, replace it with a delta function about 0. Doing so usually doesn’t lead to much error.

Frequentist testing gets this “approximation” horrifically wrong in practice because, in its infinite stupidity, it collapses quite large/spread out distributions down to a point.

But even if you make a good approximation in a single example, you can still run into problems if you make a lot them. A large number of individual “good” approximations leads usually to a globally “bad” approximation.

That’s all that’s really going on here. The Garden of Forked Paths and all the rest is pure BS. Frequentist adjustments for multiple testing is, in effect, a horrifically bad attempt to make the overall/global approximation still be good after having made series of (usually bad) smaller approximations.

Really, the key here is that whenever you make what’s called a “significance test” you are in effect collapsing a distribution down further than the evidence allows.

That’s the key to the whole thing. As soon as you realize that, you can figure out when it will get you into trouble and when it wont. The fog of nonsense surrounding this subject clears completely.

I am working on a blog example like this. Basically I do a Bayesian type analysis on a 2 parameter model, mean and standard deviation of a normal distribution. I use a likelihood based on the indicator function for a frequentist p value… and I get a posterior that is essentially an indicator function on the two dimensional mu,sigma space, it looks like a wedge.

Now, if you sample uniformly in this mu,sigma space and look at the marginal for mu, you get a peaked distribution. Not an indicator function on the region that is compatible with the test, but a peaked distribution. This reflects the fact that there is uncertainty in the standard deviation, and if the real standard deviation in the population were large, we’d be unable to reject very different mu values, but if it’s small, then the mu value we get in our sample is pretty close to the right answer.

The frequentist test for mu, a t-test, essentially takes this distribution over possible mu values, and converts it from something with a t type peaked distribution, to something which is an indicator function “inside” the interval = 1 vs “outside” = 0, accept at 5% level, or reject at 5% level.

It refuses to weight the region on the interior of the interval differently from the region towards the edges, even though for any discretization of the possible mu,sigma values, there are many many more boxes which are compatible with the test which are near to the sample mean than there are boxes out far away from the sample mean.

The Bayesian answer just keeps track of the fact that there are many many more possibilities near the sample mean. The density over the mu,sigma space doesn’t even have to vary, as it doesn’t in this case.

I will say though, that the Forking Paths question addresses more than just the Frequentist interval type nonsense. It also addresses the problem that there are many areas of science where people simply don’t spend any effort to make theories. In fact, they’ve basically outsourced quantitative theorizing to NHST. Their theory is “whatever has p less than 0.05 is my theory”. That’s not a theory.

Sure lots of people have “theories” like “identifying as conservative makes you more likely to be afraid of strangers” or some such thing… but that’s not a theory either. In essence at best that’s an observation. but only if you get a big and broad sample. Because in small and poorly designed samples that might just be a bias in your sampling methodology!

In many areas of science, where the theory is about averages in populations of humans, we don’t even have observations. It’s not even clear if there’s something there to study.

I mean, it’d be one thing if you have a good quality Census sample that shows that people who identify as Conservative also answer certain questions indicating that they distrust strangers more often than people who identify as liberal… Now you have an observation. At this point you know that there’s something to even study… At this point you can start theorizing a relatively fundamental aspect of society that might cause such a thing… and see what else it might predict, and then see if you can also observe THAT prediction coming true.

much of this nonsense we see isn’t even at the point where we know there is something there to study… power pose for example seems like a thing that doesn’t even exist.

“power pose for example seems like a thing that doesn’t even exist.”

Except in a lot of people’s fantasy lives.

Daniel:

I’d say that power pose does exist because it’s a thing that people do. And I can believe that power pose has effects because just about anything can have an effect. Meditation has an effect, right? But we’ve never seen any good evidence that power pose has

large and consistenteffects. Rather, I suspect the effects are positive in some settings and negative in others.It’s frustrating that the power pose partisans get so angry when people make this innocuous claim. It’s that thing of the distinction between truth and evidence. There’s no evidence for large and consistent effects, but that doesn’t mean that effects are zero.

Sure everything has an effect, like butterflies flapping their wings alters the rainfall in Indonesia next year… but if the effects are unpredictable then why power pose rather than say do some stretches, or chew some Caraway seeds, or imagine yourself skydiving, or play a little D&D… it’s the consistency and reliability that make a thing a thing, rather than just some stuff that happened.

Daniel:

I’m totally guessing here, but I’d guess that power pose or meditation would have a larger effect than chewing caraway seeds or the flapping of a butterfly. The point is that people do power pose deliberately to affect mood, so I’d expect it to have some effect on mood. One of the problems I see with the published power pose research is that it seems to be framed so mechanically, as if doing the pose is like taking a pill or something. I mean, sure, if this had shown a large and consistent effect for real, that would’ve been interesting. But, given that it didn’t, I think that, for people studying such things, it would make sense to really go with the idea that the intervention is directly intended to change mood, rather than to consider it as some sort of evolutionary-themed body hack.

It’d be interesting to tell people that Caraway seeds had been shown to enhance testosterone and mood and to have a positive effect on social interactions during negotiations, and see what happened… You are after all trying to affect a person’s thoughts and feelings, and one way to affect those thoughts and feelings is to convince them that doing something is going to affect their thoughts and feelings…

Just simply reporting the intervals/range of plausible values based on the evidence, or the full distribution if you like, avoids all of this. This is in fact how science used to function before Frequentist testing came along.

When Romer first made an estimate for the speed of light it had a very wide error bar to it. They just reported the wide range. As methods for determining the speed of light became better, smaller intervals were reported. Eventually the range of plausible values for the speed of light were so small we could in effect assume a point value for most purposes. (note: the speed of light was eventually redefined to be an exact value).

At no point did Romer or any other physicists have to consider what other things they were working on at the time. If they were also measuring the gas constant, they didn’t have to adjust for multiple testing. They just reported the interval/range-of-plausble values for the gas constant as well.