Roahn Wynart asks:

Scenario: I collect a lot of data for a complex psychology experiment. I put all the raw data into a computer. I program the computer to do 100 statistical tests. I assign each statistical test to a key on my keyboard. However, I do NOT execute the statistical test. Each key will trigger the evaluation of a different statistical test. I push, say, the “B” key and I get a positive result at 98% confidence. I then stop and publish. I never push any other key.

Is there something wrong with that procedure?

My reply:

1. Yes, there’s something wrong with this procedure, and the clear “something wrong” is the use of a p-value to decide whether to publish something. Even if your computer only has one key, so that your p-value is unequivocally kosher, it’s a mistake in my opinion to use statistical significance to decide what to publish. The problem is that if your signal-to-noise ratio is low, then any statistically significant estimate will be a big overestimate of the true effect, and it is likely to be in the wrong direction. This is discussed by Carlin and me in our recent paper in Perspectives on Psychological Science.

2. Is this a legitimate p-value? That’s a tough one. The easy answer is, if you choose which key to press after seeing the data then, no, in general this is not a legitimate p-value, for reasons discussed by Loken and me in our recent paper in American Scientist (the garden of forking paths). If you chooses the key completely at random, then, sure, I guess it’s an ok p-value, although this is a bit controversial in frequentist statistics as it depends on what is being conditioned on. Even then, though, I wouldn’t recommend the procedure because of point 1 above.

One day psychologists and psycholinguists will wake up to the Type S and M error problem. Until then, I am going to hurry up and publish all my low powered experiments quickly. Maybe I can relax. Maybe they never will realize what this means :). I believe Cohen (maybe also Meehl? don’t know) tried and failed to educate psychology on power, why would Gelman and Carlin succeed?

There’s a more wrong with this fantasized procedure than just the p-value emphasis. You haven’t explored the data at all. You need to explore it to learn what it might be able to tell you: is there good coverage, is there bias, do different ways to look at it give consistent results, is there good signal to noise, might there be errors in data recording, should the parameters be transformed before further study, are the data consistent with other claimed results, and so on.

Note that “exploring the data” doesn’t mean roaming around until you find a “good” p value. For an eye-opening account of what it means to explore data carefully, try Cleveland’s book “Visualizing Data”.

I think there’s a third problem, which you kind of allude to but should probably be emphasized: he needs to specify what he would have done if the first key had not given a good result. Would he have kept pushing keys till he got something?

Also, is there a typo in his name? Should it be “Wynar”?

“If you choose the key completely at random, then, sure, I guess it’s an ok p-value, although this is a bit controversial in frequentist statistics as it depends on what is being conditioned on.”

What are different things that might be conditioned on here?

Z:

There’s much discussion in the frequentist literature of what to condition on in a hypothesis test. For example, suppose you do a simple experiment in which you first choose the sample size N at random. You can define p-values conditional on N or unconditional on N, and these give different answers. This particular problem might seem too silly to be interesting but the same issues arise in more complicated settings.

For more interesting examples of what to condition on and why you might do that, see our paper on conditional randomization inference for spillovers / peer effects / interference in networks (https://arxiv.org/abs/1506.02084) or a related prior paper by Peter Aronow (http://smr.sagepub.com/content/41/1/3.short).

“if you choose which key to press after seeing the data then, no, in general this is not a legitimate p-value”

This is not a statement of fact, rather it illustrates the inherent absurdity of the Frequentism when pushed to extremes. What happens when the researcher presses the key at the same moment they look at the data? Does it go into a quantum supposition of legit and illegitimate p-values?

I don’t think the legitimacy of a p value rests on decisions made by the analyst. The p value is legitimate just exactly when it’s based on analyzing the output of random number generators. Period. If you do an analysis and you use a random number generator to sample something or to divide something into N groups or whatnot, then you compute a p value based on actual verifiable assumptions about that RNG process, you will get a legitimate p value. It’s legitimate because in essence it tells you what fraction of the RNG seeds would produce results as extreme or more extreme than your result. And, it’s legitimate because proper RNG algorithms have been run through large batteries of mathematical tests which ensure that the frequency properties of the sequences are what they are supposed to be. The legitimacy rests entirely on that large battery of frequency tests that lets us call a number sequence generator a proper RNG.

Any other use of a p value amounts to saying that some other process you are calling “my experiment” is a kind of random number generator. If you want to make that claim, you must first get it to pass the battery of frequency tests. Oh, wait, you can’t repeat your experiment 100 Billion times in order to generate a sample sufficient to feed into Die Harder? Ah, well, I guess you’ll just have to either do actual randomization using an RNG that does pass the tests, or be Bayesian and give up on p values.

Further thoughts along this line show how you could make valid inferences about a process using just p values. For example, you run an experiment in which an RNG is used to assign two groups. you then conduct a variety of tests. You discover that for test Tn you get a small p value, perhaps 0.01

now, you can legitimately infer that “either something is unusual about these two groups such that the assumptions of my test are wrong, or the assumptions of my test are correct, and I have one of the unusual samples that lead to p = 0.01 even when the assumptions are correct”

How can you distinguish? Very simple, run the randomized experiment say 20 or 100 times, and see if in multiple repetitions the p value you get from doing the same experiment and conducting the same exact Tn test results in a non-uniformly distributed p value. When you consistently get small p values, then you can conclude “the assumptions of my test are probably wrong”. (For all the good that does you!)

Wait, you say, I can’t possibly afford to run a $25M drug safety and effectiveness test 20 to 100 times in a row!!!

Oh, well, then you’d better do science (ie. causal modeling using realistic mechanistic prediction models) and analyze the science using Bayesian analysis (ie. an analysis of the plausibility of various facts given model assumptions) instead of relying on terribly inefficient frequency properties of random assignment in long-run repetitions to detect “there is or there isn’t a difference of the type Tn”.

The whole call for “replication” is really just people starting to realize that a single p value tells you “either my assumptions are wrong, or I have randomly got one of the weird samples” and the only way to distinguish which using p values is to replicate over and over again until you’re convinced of which one it is. Maybe in an other 20 or 30 years people will then finally realize also that “my assumptions are wrong” is not all that informative either, so that even if a difference replicates it still isn’t really that informative about the science. I’m not holding my breath though.

I think this also makes some language usage clearer. For example, when the original writer and Gelman write something like “is this a legitimate p value?” what they’re really asking is “given the p value is low, can I legitimately infer that the assumptions of the test are wrong?” or the like.

The answer is: if a) you used an RNG to do assignment of the groups and b) you replicate the whole experiment exactly multiple times and continue to get small p values, *then* you will be able to conclude “yes the assumptions of my test are wrong” and it won’t matter how many buttons you have on your keyboard, or whether you looked at the data first or after or had soup for breakfast or wear red on wednesdays or secretly want the results to be one way or another.

Hey Laplace, what happened to your blog (again!)? I enjoyed reading it.

Thank you for pointing me towards the “The problems with p-values…” article! In that article, at the end of paragraph 3, there is a spelling error that affects the meaning. It says “choices of regression predictors and iteractions”. Should “iteractions” be “interactions” or “iterations”?

Do every analyst are doing the same? This is why we see bad decisions all around!

I’d be more concerned with how you collected the complex data.

This also reminds me of the people (all theorists) who seek to “fix” the conservativeness of the Fisher Exact Test (or similar randomization tests) by augmenting the experimental result with a random draw to yield (ex ante) exactly 95 percent coverage. It’s the stupidity of such a procedure that caused me to give up p value significance thresholds years ago.