## Garden of forking paths – poker analogy

[image of cats playing poker]

Someone who wishes to remain anonymous writes:

Just wanted to point out an analogy I noticed between the “garden of forking paths” concept as it relates to statistical significance testing and poker strategy (a game I’ve played as a hobby).

A big part of constructing a winning poker strategy nowadays is thinking about what one would have done had they been dealt a different hand. That is to say, to determine whether betting with 89hh in a certain situation is a “good play,” you should think about what you would have done with all the other hands you could have in this situation.

In contrast, many beginning players will instead only focus on their current hand and base their play on what they think the opponent will do.

This is a counterintuitive idea and it took the poker community a long time to think in a “garden of forking paths” way, even though Von Neumann used similar ideas to calculate the Nash equilibrium of simplified poker games a long time ago.

So I’m not too surprised that a lot of researchers seem to have difficulty grasping the garden of forking paths concept.

Excellent point. Statistics is hard, like knitting, basketball, and poker.

1. Cameron Brick says:

Yes, good point. And so important you left it out: repetition. Poker players are tempting to ask themselves: do I want to play THIS hand? Instead, they should ask themselves, given the expected value of winning this hand and the probability of me winning, should I play this hand 1000 times?

2. Garnett says:

I saw this and thought of the classic WWII problem of deciding how to modify airplane armament based on the appearance of the surviving aircraft.

3. Unless I’m misunderstanding what is being said here, I really in the strongest possible terms disagree with the idea that *what you would have done if the world were a different place is relevant to what you should do given the state of the world right now*

Suppose I’m driving a jalopy down a mountain road, and I lose my brakes… How is what I would have done if I were driving a well tuned porsche relevant to my current strategy?

The only thing I can think of that might make this meaningful is that “what you would have done with a different hand” is somehow relevant because it’s symmetric to “what would my opponent do here given that he has some hand but *doesn’t have my hand*”

If that’s the point, then yes I see the point, but I think it’s mistaken to hide this insight underneath some language that pretends to make *other things that aren’t going on* relevant to your current choice.

What’s relevant then is “across all possible hands that my opponent has (which is everything but my hand) what might my opponent do” which makes it much clearer what is going on.

• David says:

Daniel, you reiterate the point of the example. The question in poker is whether a bet is a good idea based on both maximizing your odds and not revealing your likely cards in betting patterns. If you play for a long time, this will determine your outcome, since you will get all sorts of hands. In your example, it’s as if the hill might offer a gentle curve where you can slow, or might throw a tree across the road, depending on how heavy it thinks your car is and how well you steer. So figuring out what you would have done in any car is important to play as if you have any car, and take control of the betting.

Figuring odds of a given hand is easy for professional players; playing to manipulate the reads of others, by pretending to have different hands while still knowing their own odds, is the real skill. So knowing how to bet as if you had something else is what takes you from someone playing your hand, to someone playing the room – i.e. a winning player, over enough iterations. Yet it is still hard for players to get off of exactly your thought – that you’re just analyzing your particular hand, not your hand as if it was a subset of potential hands that would have resulted in the same patterns thus far.

• > The question in poker is whether a bet is a good idea based on both maximizing your odds and not revealing your likely cards in betting patterns

So, your play is determined by *your actual cards* and *historical background information about what happened in the past, and your model of the opponent’s beliefs*

I think what you’re trying to say is that you computationally speaking you probably need to model the other person’s beliefs based on something like simulations of what might have happened if other hands had been dealt. I can see that as a reasonable computational approximation to the less tractable problem of directly modeling the other player’s brain…

But I don’t think it has anything to do with the garden of forking paths.

The garden of forking paths shows us an example of why p values don’t mean what people want them to mean. p values still mean what they always meant though.

Anyway, I have basically zero interest in poker, but I accept that its a repeated game with an active adversary who builds a mental model of your strategy and leaking information about that strategy to that adversary is problematic. If simulating alternatives is a tractable way to compute an optimal low-information-leakage strategy in poker fine…

but I fail to see the analogy with the universe. We collect information from the universe, and then we need to make an inference. If we choose a t test, it’s not like the universe will magically alter our lab notebook, or even worse, start giving us different results in next weeks experiment, compared to if we used a nonparametric test…

• Angus says:

The term they use is “range” i.e. the set of plausible hands you could have in a situation. If you only play the hand you are dealt you will have a very narrow range so it will be hard to bluff people with weak hands or value bet with strong ones. When you are trying to extract value you want to offer your opponent some value against the weaker side of your range. Where I’ve always thought this starts to fail is that the rarely seems to be any distribution to the range, everything in it seems to be considered equally likely to occur.

So you get people starting make decisions like “70% of the time in this situation I should call but 30% of the time I should raise so that my range in these situations is more uncertain and I can win this pots when I turn up with a different hand too.”

A simpler way to think about it is just with bluffing. If you want to bluff you need to play realistically as if you have some other hand, not do something total different to your normal strategy and be caught obviously bluffing.

4. anon says:

Hi Daniel,

Perhaps this example will suffice to illustrate the point. Suppose we play a one round game in which we are dealt AA with 50% probability, and 72 with 50% probability. Our opponent always has QQ. There is \$50 in the pot, and each player has \$50 in their stack. Our opponent always checks first. Then it’s our turn to act, and we can decide to check or bet the remaining \$50 (assume only one bet size).

Now suppose we are dealt 72. Do we bet or check? We are risking \$50 to win a \$50 pot, so if the opponent folds > 50% of the time, it is a profitable bet, otherwise it is not. So one could argue, it is only necessary to consider just our current hand, no reasoning concerning what we would have done with AA is involved.

The only hole in this logic is… how do we really know the true probability that our opponent will fold? Maybe some superhuman machine using bayesian reasoning could do it… but it is not easy.

Alternatively, one can construct a strategy that is ROBUST to different possible fold/call probabilities of our opponent. We can decide to say, bet with Aces 100% of the time and bet with 72 50% of the time. With this strategy, we will achieve the same expected value no matter what our opponent’s strategy is.

An analogy in statistics would be the bayes risk and frequentist risk. See: https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/lectures/lecture3.pdf

• > We can decide to say, bet with Aces 100% of the time and bet with 72 50% of the time.

So, your play is *completely determined* by the *actual hand you have* (and a random number generator)

• anon says:

Well, the point I was trying to show was that what you do with 72 depends on what you do if you had been dealt Aces.

Suppose for instance that you never bet AA. Then you should also never bet 72. Because in repeated instances of the game, you will be heavily exploited by an opening that calls with a high percentage. What you do with 72 depends on what you would do if you had gotten AA.

The value of the strategy/procedure is analyze by looking at its expected value under repeated sampling (i.e being dealt different cards).

• anon says:

I guess what I didn’t make clear enough in my initial example is that the decision to bet 72 50% of the time is derived from the fact that we’re deciding to bet AA 100% of the time. If for instance, we bet a different % of AA or got dealt fewer AAs, the frequency with which we should be 72 would change.

• I think you’re talking about how you’d develop a strategy, whereas I’m talking about what information is needed to make a decision given a strategy and the current hand.

• anon says:

I agree that “given a strategy” all you need to know is what cards you have. But typically in the context of games, “decision” doesn’t refer to the physical act of, say, moving chips towards the center. It refers to the process of coming to a conclusion about whether making a given move is a “good” move (“developing” a strategy as you said is a good way of putting it). If we were already told a strategy to follow, then there would be no thinking left to do.

So the traditional way to go about analyzing whether performing action {a} with board state (observable information) {x} is “good,” would be to come up with some posterior distribution of the opponent’s likelihood of taking each given action (in the initial example I gave, this would be the probability of calling). So a play is good if E_S[f(a,x,S)] is large, where S is the opponent’s strategy and we integrate over our posterior distribution of S. This corresponds to the bayes risk.

The alternative approach to determining whether performing action {a} with state {x} is seeing if the pair (x,a) is an element of a “good” strategy d(x), where “good” means that E_X[f(d(X),X,s)] is large for many {s}, where in this case we integrate over X (the cards), and {s} (the opponent strategy) is fixed. This corresponds to the frequentist risk. The specific action is not analyzed in isolation, but we instead analyze the strategy d(x) it is part of. The strategy d(x) (possibly mixed) by definition consists of all the actions we would have taken had we been dealt different hands.

5. I don’t know. The cache that poker has seems overblown to me. It’s a game after all. There seems to be something sexy about playing poker in casinos which James Bond movie had made famous. Branding.

• D Kane says:

Bond is (mostly) famous for baccarat, which appeared in many movies and (all?) the books. I *think* that poker only showed up once, in Casino Royale (2006).

• I didn’t mean that Bond literally made it famous. The James Bond series, imo, made that highly stylized casino culture famous thru branding. Casino Royale did so directly.

I think playing poker [betting more generally] has weak explanatory for the contexts that I have seen it used. That is to say the broad lessons that issue from explanations of strategy can be harnessed through any activity. Look don’t some rely on Gosset? I could be wrong. It is an area that peaks my interest. But I think intuitively I’ll be forking down the wrong path. I analogize it to the experimentation with decision software. So this is where I think forays into ‘cognitive psychology’ may be quite useful.

• For some reason, my earlier response to D Kane did not post.

6. anon says:

As for the connection with the “garden of forking paths” concept, I apologize if I’m misunderstanding, but quoting from: http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf

“It might seem unfair that we are criticizing published papers based on a claim about what they would have done had the data been different. But this is the (somewhat paradoxical) nature of frequentist reasoning: if you accept the concept of the p-value, you have to respect the legitimacy of modeling what would have been done under alternative data.”

So the mapping here is
(research : poker)
research methodology : strategy
p-value : winrate of strategy
alternative data : deck being arranged in a different way (52! possibilities)
what would have been done: what action would have been taken (bet, fold, check)

So if the researchers’ methodology is such that if nature had “dealt” them a different dataset they would have performed a different sequence of tests and transformations, then we should average over all these potential datasets to get a proper p-value.

In the same way, the value of a poker strategy can be analyzed by averaging the winnings over all possible initial arrangements of the deck (which includes being dealt different hands).

• > So if the researchers’ methodology is such that if nature had “dealt” them a different dataset they would have performed a different sequence of tests and transformations, then we should average over all these potential datasets to get a proper p-value.

NO NO NO NO.

A *proper* p value tells you whether the data you have is likely to have come out of a particular random number generator. It does this correctly *regardless of any forking paths baloney*

The fact is just that *having gotten a small p value* does not tell you the kind of information that a typical researcher would like it to.

The point of the garden of forking paths is that *it shows just how wrong the typical use of p-values to do something they ARE NOT CAPABLE OF DOING is*

the garden of forking paths shows you:

p( The researcher knows what they’re talking about | p = 0.02) is basically very small because having gotten 0.02 on some test is *uninformative with respect to the explanation for the phenomenon* there are plenty of ways to make that happen.

but p values were never supposed to be indexes of whether The researcher knows what they’re talking about.

• anon says:

Perhaps I misphrased my comment, although I appreciate your post. I just meant that if you were looking at things like the type I and type II errors of a procedure you should account for data-dependent procedures (in different datasets).

• anon says:

I probably should have phrased this as the “threshold for significance” should be adjusted to account for all the possible datasets, would you agree with that? I definitely didn’t mean to literally “average” p-values, I meant that the distribution of p-values under the null hypothesis would be altered due to data-dependent tests/transformations.

• No, I just think *don’t use p values to decide anything that p values don’t help you decide*

if you literally want to know if a particular random number generator might have generated your dataset, then use p values. This is a useful way to detect data anomalies or filter data for example.

If you want to know anything else… Do some inference using a logical system for filtering out post-data possibilities from pre-data possibilities. (ie. bayesian decision theory)

• anon says:

That’s very interesting, thanks for taking the time to respond. I apologize if I came off as combative at all! Just trying to understand your point of view.

Taking the notation and content directly from the forking paths paper, let phi correspond the series of choices and decisions that a researcher would make in a data analysis. Let y be the dataset, and T() is the statistic.

We can view phi as a map, phi: {set of possible datasets} -> {set of possible choices and decisions}. Then, the resulting statistic is T(y; phi(y)).

What follows is not in the paper, so all mistakes are my own of course. This is just my interpretation. Let’s say we have a decision rule D (output could be to publish or “believe” a result, or anything else) that uses the statistic T as an input. Let us say there is also a loss function L that depends upon the true state of nature, theta.

Then we might want to maximize: E_Y[ L(D(T(Y; phi(Y)) , theta) ; theta], for some particular fixed theta. This expectation is taken over Y (parametrized by theta). But to compute this expectation, we need to know the entire mapping phi. Looking at the observed realization, phi(y) for some particular y does not fully characterize phi unless phi was pre-registered.

So the analogy I was trying to make was between phi in this setting, and a strategy d:{deck arrangements} -> {sequence of actions} in the context of poker. To judge the quality of a strategy d we can’t base it only on one realization d(x).

• Why would you want to maximize the expectation of gain (or minimize loss) over *all alternative datasets* Y when the only Y you will ever have is already the one you have and know perfectly right now? Unless you plan to enter a quantum many-worlds teleporter and be transported to a different world in which the data you collected last week, Y, is something new and different, and you need to know what to do once you get to that world…

What you want to maximize is the gain relative to *calculating future actions as if the true value of Theta is Th across all plausible Th values post-data*

And the class of decision rules which is optimal has been proven by Wald in 1947 to be the Bayesian Decision Rules (or rules equivalent to those). https://projecteuclid.org/euclid.aoms/1177730345

Choose a decision rule D from the optimal class, which includes informing it with an informative prior over theta, and then plug in your data Y and your loss (or gain) L and choose the Th value that maximizes your gain, or minimizes loss or whatever. If you do this honestly (in terms of encoding your real information into the informative prior, and choosing a likelihood that reflects real scientific knowledge about the mechanisms known to affect the science) then you can be sure that you’re doing the best you can to solve this problem.

If you try to hack p values using insights from “garden of forking paths” so that they work somehow differently. YOU WILL GO WRONG.

the insights of “garden of forking paths” should be taken to show you that Neyman-Pearson decisions are not a good idea, not an insight to give you ideas about how to hack p values until Neyman-Pearson becomes a good idea.

• Carlos Ungil says:

> A *proper* p value tells you whether the data you have is likely to have come out of a particular random number generator.

The particular random number generator is not the one producing the data, it is the one producing the result of the analysis. The p-value is about about the distribution af an statistic, which is a function of the data.

The *proper* p-value would be the one derived from the *actual* function of the data. If for some data the actual value of the statistic would be 666, because if you see that data you change the model, you shouldn’t calculate the p-value assuming that the statistic would have been 42 if youd didn’t change the model.

• Carlos Ungil says:

To ellaborate a bit: an easy way to look at the issue (conceptually, because in practice the problem is not well defined) is to consider that the statistic of interest is the naive “nominal” p-value. This is a function of the data which gives the p-value calculated when we see that particular dataset. The calculation assumes that the same analysis would be carried out fir different datasets. If this is not true, alternative data results in an alternative analysis and the p-values are inconsistent.

The nominal p-value will not be distributed uniformly conditional on the null hypothesis being true. It may be skewed towards low values, with a marked preference for barely significant results (p<0.05) over barely insignificant results. If the distribution of the nominal p-value over all the datasets that can be generated under the null is known (not realistic in practice) a proper p-value (uniformly distributed) can be calculated for the data at hand.

• Anoneuoid says:

The nominal p-value will not be distributed uniformly conditional on the null hypothesis being true

Yes, it will. This is probably the usual confusion because in stats 101 you are taught to only care about a parameter value the like difference between means and call the rest “assumptions”. Assuming mu1 – mu2 = 0 or whatever is simply one more assumption.

There is no reason to privilege that assumption over the others except to come up with ever more levels of misdirection and confusion like this “nominal p-value” concept.

• Carlos Ungil says:

We may be talking about different things. I mean that it won’t be distributed uniformly over actual repetitions of the experiment if the analysis of the data varies depending on the data observed in ways which are not accounted for when the p-value is calculated.

If my experiments consist in generating some data and finding an analysis that gives a p-value below 0.05, the p-values of my experiments won’t be uniformly distributed. If I make 100 experiments and the null hypothesis is true each and every time, the p-value will be below 0.05 100% of the time.

If the null hypothesis is always true but all the assumptions in the analysis are correct, I will get p<0.05 only about 5% of the time.

• Anoneuoid says:

If my experiments consist in generating some data and finding an analysis that gives a p-value below 0.05, the p-values of my experiments won’t be uniformly distributed.

Right, because the null hypothesis is false*. It really is as simple as that.

It is the scientists job to derive a null model they think is true from assumptions they believe more or less hold. Did they include “trying out all sorts of different analyses” when deriving the null model? In your case no.

I’m not quite sure what assumption would be violated by trying out all sorts of different analyses (as opposed to data subsets, etc), but probably its again somehow the iid assumption. Either way, you don’t really need to figure out the precise mathematical/logical flaw in that case. It’d be easy enough to run monte carlo simulations of the analysis pipeline to get the null distribution of whatever statistic.

If I make 100 experiments and the null hypothesis is true each and every time, the p-value will be below 0.05 100% of the time.

This is impossible if you use “null hypothesis” to actually refer to the model being tested. If you use it to refer to something else (which is very common, even standard) all bets are off.

If the null hypothesis is always true but all the assumptions in the analysis are correct, I will get p < 0.05 only about 5% of the time.

This is redundant: “Null hypothesis is true” == “all the assumptions in the analysis are correct”

*Actually I think null model is a much better term now since null hypothesis has been ruined by people using it to incorrectly refer to the value of a single parameter of the model. “Model” conveys something greater than a single value.

• Carlos Ungil says:

> It’d be easy enough to run monte carlo simulations of the analysis pipeline to get the null distribution of whatever statistic.

It’s not easy at all (I agree with Daniel about this being impossible in practice). But you seem to agree that a “proper” p-value calculation has to take into account the actual output that would be produced under different realizations of the data. Otherwise you get a p-value that is not correct (because we didn’t model correctly the full analysis pipeline). This incorrect p-value is what is usually called the “nominal” p-value.

• I disagree with this characterization of “the p value is incorrect”. The problem is a desire for the p value to mean something it doesn’t.

Suppose you collect data x1 decide to do test T and get p = 0.02, and suppose further that the null model works well to describe the data generating process

if you collect data x2..xn and continue to do test T, because the null model is adequate, you will get a p value that is 0.02 or less only 2% of the time in replications of this experiment *testing the same model*

Yet, it’s correct that *you are not going to replicate this experiment very many times* and you chose test T because you thought given x1 that it was a relevant test, in other words, it seemed to be “non null” in this data.

If you had gotten dataset x2…xn since the null model is adequate, you would have seen that test T was likely to produce a p value say 0.22 and not chosen to use test T.

Your p_T=0.02 result is small *precisely because you noticed it seemed anomalous and chose a test T that highlighted this fact*

Now, if you insist that N-P testing should “work” and that you should only get wrong results 5% of the time when you choose p=0.05 in your test… REGARDLESS of how you choose your test T, then I suspect we can prove mathematically that *this is not possible without specific control over the method of choosing test T*

is this a “flaw in the p value?” no. the p value is still giving you the property of only 2% of replicated datasets x2…xn will have p less than 0.02 when the same test is used on repeated data collection.

And that’s all p values *were ever supposed to do*

The problem is N-P decision making assumes that this p value is also equal to some kind of “magical decision error probability p*” and people have hit on this as the thing they desperately want p values to provide them.

they don’t, they can’t, they never will, let’s focus on this fact and encourage people to do something else.

In particular, let’s encourage people to use decision rules from the Wald Complete Class Theorem: Bayesian Decision Rules with real-world Utilities.

Then we really can say: given the assumptions I’m willing to make about the system (prior and likelihood) and the goodness of different errors we might make, the value of the unknown quantity that maximizes my expected goodness is theta*

• >statistic of interest is the naive “nominal” p-value

The p value is well defined and deterministic once you define your RNG. You calculate it, and you’re correct.

What’s not correct, is the assumption that you can use the p value for decision making. Neyman-Pearson type 1 and type 2 errors etc.

>The calculation assumes that the same analysis would be carried out fir different datasets. If this is not true, alternative data results in an alternative analysis and the p-values are inconsistent.

No, the calculation of the p value doesn’t assume this, the *use of the p value to make decisions* assumes that.

Neyman-Pearson decision making based on p values is inherently broken. The solution is to stop using p values in this way.

Unfortunately, what I’m seeing is that Andrew’s garden of forking paths, which is really about *how broken the decision making based on p values is* is taken as a way to somehow *try to fix p values so they work* instead of *abandoning p values in favor of a better system for decision making*

We already have a provably optimal class of decision making rules: Bayesian Decision Theory, and the decisions they make are always *far far better* because they take into account the utilities, whereas p values don’t.

So ultimately, this is what I’m pushing back about. People think that somehow they can adjust p values by taking into account alternative worlds and the alternative p values that would have been calculated if the world were different… and turn p values into good decision making tools. p values are only ever good for 1 kind of decision making: deciding whether to treat a particular data point or set as if a particular RNG might have produced it, or if you need to come up with a different model.

In typical usage, failing to reject the null hypothesis is the informative outcome: then you can treat this data as if it were noise of the particular type you assumed.

When you reject the null hypothesis, all it means is *you don’t know what happened* but it wasn’t the thing you thought might have been happening.

• Carlos Ungil says:

> The p value is well defined and deterministic once you define your RNG. You calculate it, and you’re correct.

You’re correct only as long as your definition of the p-value is correct. I’m not sure what is “your RNG”. If you think the RNG is a model of the data generation process, you’re only half there. You need to model the “output” generation process.

The output is the result of applying a function to the data. The distribution of this statistic (conditional on the null hypothesis being true) is the result combining this function with the distribution of the data (conditional on H0).

The p-value that you calculate once you define your RNG will only be well defined if you’re using the correct function. The calculation involves calculating the statistic conditional on alternative datasets, so it won’t be correct if it doesn’t reflect what would have been done conditional on alternative datasets.

Let’s say you’re measuring some unknown quantity alpha with a device that is known to give a measurement (let’s call it z) which is unbiased and normally distributed, with variance one.

If you observe z=1.7 you can calculate a p-value < 0.05 for the null hypothesis alpha0 had you observed a negative value of z, then I know that your p-value cannot be accepted at face value. In that case, you’re guaranteed to get p-values<0.5 only. And you would get p<0.05 more often than one in twenty.

I’m not suggesting that one can fix p-values, the point is that p-values are broken if they rely on unrealistic assumptions about how the analysis would have been done if the data had been different.

• Carlos Ungil says:

If you observe z=1.7 you can calculate a p-value below 0.05 for the null hypothesis “alpha is below zero”.

But if I know that you would have chosen the null hypothesis “alpha is above zero” had you observed a negative value of z, then I know that your p-value cannot be accepted at face value. In that case, you’re guaranteed to get p-values below 0.5 only. And you would get p below 0.05 more often than one in twenty.

• You’re correct only as long as your definition of the p-value is correct. I’m not sure what is “your RNG”. If you think the RNG is a model of the data generation process, you’re only half there. You need to model the “output” generation process.

The output is the result of applying a function to the data. The distribution of this statistic (conditional on the null hypothesis being true) is the result combining this function with the distribution of the data (conditional on H0).

For the purpose of this exercise let’s call the RNG the one that is relevant for the p value. If you are interested in one data point, then the statistic function is f(x) -> x[i] and the RNG of interest is the one that generated the data, whereas if you’re interested in a t tests the statistic is f(x) -> mean(x)/sd(x) and the RNG is the one that represents the distribution of mean(x)/sd(x) which is approximately normal for largish n… etc

>The calculation involves calculating the statistic conditional on alternative datasets

Yes, but *not* alternative hypotheses (RNGs). Because every well defined p value is conditional on a choice of RNG. The problem is people want p values to mean something they aren’t they want an “unconditional” p value that doesn’t matter what modeling assumptions they make…

and it *just doesn’t exist*

the garden of forking paths concept *highlights this nonexistence* it doesn’t *provide a method by which you can tweak p values to mean what you want them to mean*

The problem is people want “meta” p values that give the following

Fr(observe data x, choose test T, calculate p value p=P(T(x))) = p

This is simply a mathematically UNDEFINED thing without specifying a very specific random process for “choose test T”, this is the “garden of forking paths” concept. But it’s just a tool for getting people to understand what’s wrong with making decisions based on p values. The frequency with which p values come up 0.05 not in repeated application of the p value procedure to a specific type of experiment, but in repeated application of the generic “let’s do an experiment on some topic and see what happens” is just not equal to the output of any specific p value. Making it equal requires predicting the future frequencies *of a brains thoughts*

if there even exists such a random process it is necessarily both *personal* (ie. specific to the psychology of a given researcher) and *completely unverifiable*. It’s description length would be huge (imagine writing a piece of code that implements the psychology of the researcher, it’d be terabytes upon terabytes of code), nor would having its description even help anyone do anything.

the point is p values don’t tell you what you want them to tell you, people want them to tell them “how often would this researcher be wrong” but all they tell you is “how often would this particular analysis be wrong if it were carried out on the output of this specific random number generator”

• I guess my most recent post highlights a way in which poker is conceptually like the garden of forking paths idea…

In poker you *really are* trying to build a model of the psychology of the player, either opponent, or potentially yourself, in a VERY restricted domain, and so you’d like to for example adopt a psychology that leaks as little information as possible, and you’d like to consider whether your current play is easily detectable as unusual compared to average, and determine how to minimize that so that your opponent can’t figure out very much information about your hand…

Perhaps that’s the original insight, but I still emphasize that in general science… the model of the psychology of the researcher *is of no help to anyone* in understanding how the actual scientific phenomenon under study works.

• Anonymous says:

Daniel, you may want to re-read the comment to which you replied “NO NO NO NO”.

I don’t think that by saying that “So if the researchers’ methodology is such that if nature had “dealt” them a different dataset they would have performed a different sequence of tests and transformations, then we should average over all these potential datasets to get a proper p-value.” anon was suggesting a realistic procedure to fix p-values.

I don’t think we can model what the behaviour of each player at a poker table will be in any possible situation either, by the way.

• Well he said the p value wouldn’t be “proper”

but the p value is totally proper, in the sense that its mathematically defined meaning would still hold, it’s just a p value relevant to a different hypothesis than you would have had if you had gotten some different data.

My point was that p values, as they stand today, are totally *proper* it’s just the *use to which they’re put* in terms of Neyman-Pearson decision making that is improper.

By claiming that the p value “isn’t proper” you implicitly say that Neyman-Pearson is the *raison d’etre* for p values and the failure to meet the decision frequency means that the p values are broken. But in reality, it’s just N-P decision making that’s broken, and p values do exactly what they’re supposed to do already.

• Carlos Ungil says:

[the anonymous above was also me, the following comment has been written before I saw your latest reply]

Ideal gases don’t exist either, but considering their properties can be useful.

A p-value calculation assumes, among many other things, that the same statistic of the data would be computed for every possible value of the data. If this assumption is wrong, the p-value is not correct. As with every assumption that might not be exactly true, the impact depends on how wrong it is. “The null hypothesis is never true”, someone may say. Ok, but if it’s approximately true it doesn’t matter that much.

No p-value ever is correct, because every model is wrong (your RNG will never be correct). If the model assumes normal errors for measurements, the p-value calculation assumes that you would be happy to calculate the statistic of interest from a dataset containing negative heart rates, for example. Probably in that case you would leave out or redo the measurements, so the data-independence assumption is not really correct. But this scenario is so unlikely that you don’t care.

Taking as the statistic of interest the p-value that would be obtained from the analysis of each different dataset allows us to define the meta p-value, as you called it. The point is not that it can be used to fix the p-value, the point is that if there are reasons to think that this meta p-value would be too different from the calculated p-value (because it’s not so unlikely that things would have been done differently) then the p-value is not even approximately correct.

• https://webhome.phy.duke.edu/~rgb/General/dieharder.php

Is a great suite of RNG testers. Every p value calculated there takes on its mathematically defined meaning:

Fr(T(x*) more extreme than T(x) for x the current dataset, and x* coming from future draws from the RNG)

That’s all p values are ever supposed to mean. The fact that you would choose a different RNG to test if you had a different x in your experiment is problematic *for the frequency with which the N-P decision rule will give a good answer* but *not* problematic for the actual definition of the given p value, which is always about *future draws from the same given RNG*

People can’t seem to get past the fact that p values don’t mean what they want them to mean: frequency with which the N-P decision rule will fail

• Carlos Ungil says:

> That’s all p values are ever supposed to mean.

P-values are based on the distribution of the statistic of interest over different realizations of the data. You say that it is completely irrelevant that if we actually had different realization of the data we wouldn’t even think about calculating this statistic, because the hypothetical realizations of the data are just a calculation device existing in the context of the data at hand and this is all p-values are supossed to mean.

One could gain some insight by considering that in some cases the sampling properties of p-values might also extend to the real world (instead of being restricted to the context of the realized data), in the sense that the same statistic would be calculated for different realizations of the data. A p-value with the correct sampling properties in the broader context (potential replications of the data not restricted to the analysis being carried for a particular value of the data set) is stronger. If you don’t want to call it a “proper” p-value you might still want to find a name for this concept which is not completely useless.

An analogy: if you’re looking at investment strategies you may want to look at the track record (the performance of the strategy in the past).

1) you may be given the performance of a fund that has been applying this strategy since 2000.
2) you may be given the performance of a backtest (a simulation) of this strategy since 2000.

Both conform to our definition of track record (the performance of the strategy in the past), but would you consider them equally relevant?

• Carlos Ungil says:

I forgot: the fixation on Neymann-Pearson decision theory seems a bit misplaced, p-values are Fisher’s measure of strength of evidence.

• Fisher also invented fiducial inference and did a lot of other wacky stuff. In practice, what people subject to the forking paths problem do is N-P type decision making: act as if my favorite hypothesis is true if my null hypothesis produces a small p value.

That’s the flawed logic that Forking Paths points out, and that’s the flawed logic that I’m emphasizing in this context. It’s the logic that leads people to psychologically want to force p values to mean 0.05 only in case there’s really a “5% chance that my result is just random noise” or some other meaningless thing.

In other contexts, p values are just fine. They tell you whether a certain stochastic model of what happened is adequate to explain what happened, or not.

It turns out that in looking for a definition of a random sequence Per Martin-Lof hit on the idea that you can filter random sequences from other sequences by a statistical test using p values. He proved the existence of a (non-constructive) most powerful computable test, and then defined the random sequences as those that pass this test in a certain way.

So now, mathematically, we have certain p values defining what it means to be random. That’s a legit use of p values.

But “a measure of the strength of evidence”? well, only in case you’re measuring the strength of the evidence that a given, specific, specified RNG produced your data (or your summary statistic, or whatever).

7. anon says:

I would definitely agree that p-values are overused and wasn’t trying to advocate for or against them.

Would you agree with the following though? Let y* be a particular dataset and T(y*) a calculated statistic. To keep things simple, we can take ” more extreme” to mean greater than the observed statistic. Then the p-value is:

P(T(Y) >= T(y*) | H_0). (where the source of randomness is Y, the dataset)

Letting 1{} stand for the indicator function, this is the same as

E_Y[ 1{T(Y) >= T(y*) | H_0 ]

However, the true calculation should incorporate phi, so it becomes:

E_Y[ 1{T(Y,phi(Y)) >= T(y*,phi(y*)) | H_0 ]

If phi is assumed to be a constant function of y, then this calculation will be incorrect if it actually isn’t a constant function.

• Yes, this is more or less what I was saying above in:

http://statmodeling.stat.columbia.edu/2018/05/21/garden-forking-paths-poker-analogy/#comment-741537

The thing is, here you’re taking *as a definition of the p value* that it gives p less than 0.05 in *repeated uses of p values for generic Neyman-Pearson decision making purposes when the null value of theta is true*. That is, if I do an experiment today on some psychology issue, and tomorrow on the speed of light, and the day after on income inequality, and the day after on nutritional issues in developing countries, and the day after on stents to treat heart disease…. and each time I collect some data, and decide what test to use after the data, and then calculate the test, if each time I do this *fixing the real theta as the one chosen in my test to represent the null value* and I decide on my choice of test T according to my own proclivities, that I will successfully be able to adjust the p value so that only 5% of the time will it give me less than 0.05

and that’s an INSANE definition of p value… yet it’s the one people intuitively deeply psychologically NEED p values to have, because if they discovered what p values do really mean, they’d have to commit Harakiri for having polluted the world of science so badly.

Now, I’m not saying that you advocate doing this… I think I now see why you see an analogy between this *flawed thought process* and a *not so flawed thought process in poker* where you really *do* care about leaking information about your own psychology through your choices.

But basically, the use of Neyman-Pearson decision making in the presence of post-data test choices *inevitably intertwines the deep seated psychology of the researcher with the output of the scientific claims* in a mockery of “objective” decision making. And no amount of fiddling the p values will fix this. “Garden of Forking Paths” is not a prescription for how to make p values do what researchers really want them to. It’s a cautionary tale about why they can *never ever do that thing*

8. anon says:

Thanks for the reply, I think I’m finally starting to see your objection (maybe?). I’d like to respond first to something you said in a different comment:

“Perhaps that’s the original insight, but I still emphasize that in general science… the model of the psychology of the researcher *is of no help to anyone* in understanding how the actual scientific phenomenon under study works.”

Could you elaborate on this? I’m not sure about the word “understanding,” but would you allow that a model of the psychology of the researcher is of help in decision making?

Repeating the quote I had taken from the forking paths paper earlier:

“It might seem unfair that we are criticizing published papers based on a claim about what they would have done had the data been different. But this is the (somewhat paradoxical) nature of frequentist reasoning: if you accept the concept of the p-value, you have to respect the legitimacy of modeling what would have been done under alternative data.” (pg 2)

Isn’t the “modeling what would have been done under alternative data” exactly a “model of the psychology of the researcher”? Or do you view this as an argument by contradiction that refutes the legitimacy of frequentist reasoning in general?

• >Or do you view this as an argument by contradiction that refutes the legitimacy of frequentist reasoning in general?

Certainly the legitimacy of Neyman Pearson decision making, or perhaps the more general “what is usually done with NHST” decision making, in which researchers find some way in which their data can be tested to produce a p value less than 0.05 and then claim that whatever their favorite alternative is must be true.

Modeling the frequency of something that has a meaningful frequency is of course a legit thing to do. A good way to do it is to make the frequency itself a parameter, and then do Bayesian reasoning on what frequency values are probable after seeing the data. In order to do this meaningfully you have to understand the difference between Bayesian Probability, and Frequency.

If the scientific process you are studying is “what researchers decide to do while researching when they have very little real knowledge of how stats really work” then we’ve probably come to the end times and we should just give up and start looking for the 4 horsemen ;-)

In any other case, knowing a model of the psychology of someone who doesn’t really understand how statistics can help them or what p values really mean isn’t helpful for understanding say methods of treating addiction, or the growth properties of hormone treated cows, or pollution levels in rivers or whatever actual scientific topic was at the heart of the improper forking-paths analysis in the first place.

• anon says:

“In any other case, knowing a model of the psychology of someone who doesn’t really understand how statistics can help them or what p values really mean isn’t helpful for understanding say methods of treating addiction, or the growth properties of hormone treated cows, or pollution levels in rivers or whatever actual scientific topic was at the heart of the improper forking-paths analysis in the first place.”

I don’t know about the word “understand,” but wouldn’t having a model of their psychology allow one to apply the proper level of skepticism (degree of belief being viewed as a “decision” here) to their claims/research findings?

As for the poker example, the thought process I outlined (thinking about the entire strategy) is orthogonal to the notion of “psychological leakage,” although the two can be used together.

There’s a real-life example relating to this. Researchers at CMU recently designed a software program to play poker at a very high level, called Libratus. The goal was to program this software in such a way that it would have a very high chance of winning no matter what opponent it faced. This is possible due to the existence of a nash equilibrium.

Once the software is written, it is fixed. The machine’s decision at each stage is entirely based on an algorithm that was chosen to satisfy good long-term properties under repeated sampling (i.e high winrate over many hands). Let’s say theta is the strategy/psychology of the opponent. This seems to me like more of a frequentist approach since the software doesn’t care about what theta is. An optimal bayesian approach would try to model opponent behavior based on their history, and ideally employ some sort of speech recognition and visual system to get physical clues… and so on, to get a posterior distribution of theta. This would be much harder to do than what was actually done.

So if we watch the software program make a play in some situation, we can’t criticize it for playing “badly” unless we know its entire strategy (i.e what it would do in all possible scenarios). As long as it doesn’t have a software bug, it’s working “well.”

Now let’s say the editor of a journal J decides to take a sabbatical for a year and write a software program to make accept/reject decisions of scientific papers for him. The journal is influential and impacts the beliefs and actions of the general readership. As such, the editor wants to limit the number of accepted papers would fail to replicate.

Suppose in addition that the editor has access to a historical dataset of 1000 submitted papers that were randomly selected for replication, a binary outcome variable y that shows whether they successfully replicated or not, and explanatory variables x (such as the reported p-value, the sample-size n, etc…). The editor could fit a regression model or some other statistical model to the historical data. There’s no reason to think that explanatory variables that capture the “psychology” or education/history of the researcher would not have predictive ability. Perhaps we can imagine that the researchers took a questionnaire whose results were correlated with their research process, for example.

Finally, once the statistical model has been trained, the software will make accept/reject decisions mechanically. This is a frequentist approach because the decisions are made according to a procedure that we think has good properties under repeated sampling.

At the time this journal decision software is being designed, many datasets that will be studied have not been generated yet. So if a toy version of the software were to use p-values as a primary input variable, we would ideally want to have a “model of the psychology of the researchers” producing these p-values, i.e phi(y).

• Now that we really are studying the psychology of people trying to submit papers to journals I’ll ride off into the sunset with the 4 horsemen of the apocalypse ;-)

9. Not so fast now.

10. Paul says:

Andrew, I sometimes find your forking paths concept a bit confusing. You seem to distinguish between explicit optimization of multiple tests for significance and merely choosing tests that somehow depend on the data. But significance will not be inflated unless these data dependent choices are at least implicitly optimizing for significance. It seems like a distinction without a substantial statistical difference.

• Andrew says:

Paul:

The issue is that people would say things like: Hey, I’m not p-hacking, I only did one analysis on my dataset. In our paper, Eric and I point out that, even if you only do one analysis on your data, there are still forking paths if you could’ve done other analyses, had the data been different.

Suppose there are 1000 different analyses you might have done (here, I’m including data coding and exclusion rules as part of the analysis choices, and in that case the combinatorics make it easy enough to have thousands of possibilities, even in a simple problem), but with any particular dataset you only do one of these. Then, sure, you’re correct that you could consider the other 999 analyses to have been performed implicitly. Mathematically, there’s no difference between an implicit or an explicit analysis. But I think it feels different to a researcher. So from the point of view of research practice, it is a distinction with a difference.

• Paul says:

This is precisely where I believe your concept is mistaken, or at least insufficiently developed. You don’t get inflated effect sizes and p-values merely from the existence of alternate ways to do an analysis. If you throw one dart at a board, your chance of hitting the bullseye is not increased by the number of darts you had available for throwing. Your chance is only increased if you actually throw all those darts.

A “data-dependent selection mechanism” does not, in itself, change the situation. To get inflated results, the data-dependence has to be based on some kind of effect-optimizing search over the space of forking paths. That effect-optimizing search is not necessarily done through formal p-hacking or fishing methods, but this is just a fig leaf. It is very easy to optimize by purely exploratory methods, such as looking at multiple subgroup summaries and choosing to focus on the one that maximizes the effect of interest in your study.

I think scientists are professionally obligated to know when they are doing this kind of thing and report it. And statisticians are professionally obligated to educate them on that obligation. Your 2013 paper was a significant advance but more clarity is needed to change behavior.

• Andrew says:

Paul:

I think that when people make their decisions on data coding, exclusion, subsetting, analysis, etc., they are guided by the outcome but in an informal way. For example, someone in the control group didn’t follow the instructions (as judged by some post-experiment survey) but did well on the task anyway. That looks fishy—he didn’t follow the instructions, so exclude this data point! But if someone in the treatment group didn’t follow the instructions but did well on the task anyway, the researcher might think that the person just didn’t fill out the post-experiment survey accurately, and thus not exclude the data point. Lots and lots of such decisions get done, not always with any formal data analysis of the different options.

Suppose a researcher makes 10 such decisions: that’s 2^10 = 1024 possibilities. The researcher might feel that he or she has just done one analysis because all the choices were made implicitly, while looking at the data. Or maybe it feels like 10 analyses, if the researcher is particularly self-aware when making these decisions. But it won’t feel like 1024.

In any case, I’m sure you’re right about more clarity being needed. I hope this discussion helps.

• Paul says:

Yes, I see your point much better now.

I’m a bit overwhelmed at the prospect of managing such an issue, when post-hoc rationalization comes so naturally to people! But it’s heartening to see the discussions that are happening here and elsewhere.

• Anoneuoid says:

If you throw one dart at a board, your chance of hitting the bullseye is not increased by the number of darts you had available for throwing. Your chance is only increased if you actually throw all those darts.

Isn’t Andrew’s “forking paths” more akin to having partial freedom to move the location of the bulls eye depending on where the dart ended up? You can’t move it all over the board but can make some adjustments here and there.

11. Me too b/c I’m a novice and I learn a great a deal from this blog. But I had an inkling back in 90’s that monitoring the trajectory of ‘statistics’ was necessary. I wish now that I had maintained this field from then. Then again sometimes it is better learning experience when you are an outsider to it too. It’s the poker analogy with which I have issues more generally because I think it delimits how we apply it to other queries. Just a hunch really.

12. Anoneuoid says:

Starting new thread due to nesting…

Carlos Ungil wrote:

> It’d be easy enough to run monte carlo simulations of the analysis pipeline to get the null distribution of whatever statistic.

Here is code that gives the null distribution of p-values if choosing to remove outliers and/or do non-parametric vs parametric tests depending on the data: https://pastebin.com/ct4LZMsB

And the distributions of p-values (you can figure out what corresponds to < .05 or whatever from this):
https://image.ibb.co/fDBq1T/p_scenario.png

That is the type of thing I was thinking of. What requirements would you add to make it impractical?

But you seem to agree that a “proper” p-value calculation has to take into account the actual output that would be produced under different realizations of the data. Otherwise you get a p-value that is not correct (because we didn’t model correctly the full analysis pipeline). This incorrect p-value is what is usually called the “nominal” p-value.

I may not understand this, it sounds like “didn’t model correctly the full analysis pipeline” means “the researcher chose to test a null hypothesis they ensured was false”. That is exactly what I think is standard in many areas of research today, so do not see where you disagree. What do you mean by “different realizations of the data”?

• Anonymous says:

Your calculation assumes that the what you call the “analysis pipeline” is completely determined before the experiment is conducted. This may be (approximately) true when the details are pre-registered and everything runs according to the protocol (and you still have to assume that everything would have run according to the protocol no matter what the realized data would have looked like). Corrections for multiple comparison are routinely done, for example.

When the analysis is determined along the process, depending on the data at hand, doing a Monte Carlo simulation of alternative scenarios is not possible. Choosing priors according to the data is also wrong as a matter of principle, by the way.

• Anoneuoid says:

We can look at it like:

1) You need to model the process you think generated the data
2) You make up that process as the data comes in
3) You modify the model to reflect the process

If you are making up the model as you go along and want to test it, then that also needs to be part of the model. In that case, ok things are getting so confused that maybe it is impossible. It is probably easier to just check a model you come up with after exploration/abduction on new data… I’d like to see someone attempt it though.