The problem with p-hacking is not the “hacking,” it’s the “p” (or, Fisher is just fine on this one)

Clifford Anderson-Bergman writes:

On CrossValidated, a discussion came up I thought you may be interested in. The quick summary of it is that a poster posed the question that isn’t Fisher’s advice to go get more data when results are statistical insignificant essentially endorsing p-hacking.

After a bit of a discussion that spanned an answer and another question/answer pair, a rather interesting concept came up. That is, at many times in real world analysis of data, an analyst who uses statistical significance is forced to make a choice between preserving type I error rates and minimizing type S/M errors. Unfortunately, many researchers will dogmatically think it is better to preserve type I error rates, despite the fact that type 1 errors are typically impossible and type S/M errors are very likely!

Anyways, I thought you might find the conflict of these two approaches interesting. If you’re curious about the discussion, here are the two posts:
https://stats.stackexchange.com/q/417472/76981
https://stats.stackexchange.com/q/417716/76981

I clicked the first link above, which started with this question:

Allegedly, a researcher once approached Fisher with ‘non-significant’ results, asking him what he should do, and Fisher said, ‘go get more data’.
From a Neyman-Pearson perspective, this is blatant p-hacking, but is there a use case where Fisher’s go-get-more-data approach makes sense?

My quick answer, and it’s not a joke, is that it’s not “p-hacking” if you never try to make a claim of statistical significance. See this post from 2018 about why I think sequential data collection is OK, not at all a problem in the way that many people think.

To put it another way: If your only goal as a scientist is to reject the null hypothesis of zero effect and zero systematic error, then, sure, you can just gather more data until you reject that hypothesis. But so what? There was never a good reason to take that hypothesis seriously in the first place.

The problem with p-hacking is not the “hacking,” it’s the “p.” Or, more precisely, the problem is null hypothesis significance testing, the practice of finding data which reject straw-man hypothesis B, and taking this as evidence in support of preferred model A.

The point I just made is also expressed at the second link above. I’m glad to see these ideas being spread in this way.

37 thoughts on “The problem with p-hacking is not the “hacking,” it’s the “p” (or, Fisher is just fine on this one)

  1. My understanding is that Fisher advised getting more data when you DO reject, hence a 5% critical value is not as much a concern. The goal is to cast a wider net to make fewer Type II errors and cut down on the Type I errors by repeating the study. But the advice to get more data is always good. I tell students that if someone, knowing they had a stats course, asks their advice on an analysis, and they don’t know what to say, the safe advice is always “get more data”.

    • ‘the safe advice is always “get more data” ‘

      I think a lot of people who have gone this route would disagree. If you need more data either your model is wrong or your method has unsupportable assumptions. Time to review the study design and/or seek a different angle on the problem.

      • Jim:

        Part of the problem is with the notion of “safe advice,” which is related to the common idea in statistics of “conservative procedures.” These are attitudes that miss the idea that there are tradeoffs and opportunity costs. Which is ironic, given that acknowledging tradeoffs is a central point of traditional political conservatism. Aleks and I discussed these issues a few years ago in our article, Bayes: radical, liberal, or conservative?

        • ‘the notion of “safe advice”’

          Yes, it would be surprising if there’s were some kind of blank check advice you could give for any “X” type of science project!

          I thought your paper was interesting in that it encapsulates so much that’s challenging in science and statistics. So few “data” and so many questions!

          You show four “data” but they aren’t actually individual measurements, they’re aggregates of four experiments. The experiments are summarized as number of fatalities out of five animals in each experiment per dosage, but there’s no data on the (a-m) other conditions of each experiment, or (n-z) the health effects on the survivors. It might – having seen enough to know that it also might not – be sensible to presume that a-m are constant in each experiment, but surely some of n-z vary among individuals and may have seemingly obvious explanations that would add to the understanding of the overall outcome.

          All of these consideration are hiding in in the data table, before any statistics have even been contemplated.

          From the statistical standpoint:

          Sampling: I plotted the data logged and unlogged. The log plot looks beautifully linear, the standard chart beautifully exponential, but of course it’s only four “samples” and in this case – contrary to my previous advice – more “samples” are clearly desirable, as a single additional experiment could change the entire outlook. Just the same there’s no sayin’ that this particular representation of the experiments – fatalities vs dose – is the most enlightening, and I could just as easily imagine any number of plots of dosage vs (health effect) vs time, which might yield enough info to eliminate the suggestion of further sampling.

          It’s cool to suppose a variety of prior distributions, but shouldn’t those be predicated on some aspect of reality, rather than just treated as a bunch of equally likely possibilities? My thinking is that this level of statistical summary would be appropriate only after a-m and n-z have been analyzed, at which point there may be some clear justification for the prior – or not. If not trying on priors to see how they fit seems reasonable, but in the end as always they have to be tested against new data in new experiments, and ultimately wouldn’t it be desirable to generate some natural explanation for the prior?

          But after all that, at the end of the day, presuming a-m are constant, it’s pretty clear that the substance in question is fatal at high doses, so the exact probability of that might be an interesting thing to calculate but in reality no one is going to want to get close, so the fine details might not be important anyway.

          :)

  2. An important consideration seems to be missing here (and largely missing from the discussions on Cross validated): the effect of gathering extra data varies with the nature of the inferences that are to be drawn.

    If the scientific inference is to be a decision based on a significant/not significant outcome of a Neyman–Pearsonian hypothesis test then (god help us!) adding more data does disrupt the calibration of the inferential procedure. On the other hand, if the scientific inference is to be made on the basis of a thoughtful consideration of evidence then more data is more evidence, which is a good thing.

    The p-value from a significance test (not a hypothesis test, and not the thing called “NHST”) is an index of the evidence. It is the fact that the evidence is anchored to a null hypothesis within the model that you should be concerned with, not the truth status of that hypothesis that or the fact that it can be called a straw man.

    Discussion of ideas like these is hampered by the predominant attitude that ignores the essential distinction between statistical inferences, which are formed on the basis of methods working within statistical models, and scientific inferences which are formed by human minds operating in the real world.

    I’ve written a beginner’s guide to this that many will find illuminating: https://link.springer.com/chapter/10.1007/164_2019_286

  3. If you collect more data and ADD it to the previous data it is not very serious p-hacking. You just have to do an adjustment for the sequential design.

    The problme is not the p-value at all. It is not using all the data which is identical to throwing away data you don’t like. Bayesians seems to continually accuse frequentists of not using prevoous knowledge. But really – when was the last time a Bayesian analyst spent half their research time summarising all previous studies into a likleihood function and then used this as their prior? They never do!

    • “when was the last time a Bayesian analyst spent half their research time summarizing all previous studies”…yeah, I agree, that never happens. But I don’t think it should, either, or at least not usually.

      It’s true that getting the prior distribution _exactly_ right should enable you to add new data without any loss of information, but I don’t think I’ve ever encountered (or heard of) a situation in which “exactly right” is a reasonable goal. Previous studies are done with different methodologies, different populations, etc.

      If the prior distribution dominates your data then your dataset isn’t contributing much. If your data dominate the prior, for any reasonable prior, then you don’t need the prior. In most of the middle ground you should use an informative prior distribution but the details won’t matter that much, certainly not enough to justify spending ‘half your time” looking at previous studies.

      For example, based on what I know right now, if I were analyzing some new data on, say, the effectiveness of wearing N95 masks as protection against COVID transmission in the general population, I’d have a prior with the bulk of the probability in the range 0%-30%, with tails on both ends (mildly harmful, very protective). Since my current state of knowledge is based mostly on reading a few abstracts that have been mentioned over the past year in comments on this blog, I would certainly spend a couple of hours checking the rest of the literature to see if I should modify that, and I would do so. But if I had enough data to be worth analyzing, it wouldn’t matter whether my revised prior had just a bit more probability in the worse-than-zero end, or a bit more above 30%, or was slightly narrower.

  4. Seems to me that if you decide to collect some more data and then evaluate p again, then even if you like the result you should collect the same amount again and evaluate p again. That would give you at least a small guard against adopting a favorable p value that you got after the first data addition.

    But really, no matter how one likes to interpret “meaning” of a p-value, there is no getting around its poor statistics. From a given data set one only can ever know an estimate of the p-value. Give, say, a sample p-value of 0.04, what is the chance that is, for example, actually greater than 0.051? Pretty large. IOW, if someone plans to base a decision on the value of p, they should be examining the chance of the decision criterion being wrong.

    • > Give, say, a sample p-value of 0.04, what is the chance that is, for example, actually greater than 0.051?

      The chance that _what_ is actually greater than 0.051? There is no “true p-value” with some definite value above or below 0.051.

      Maybe you mean that there is an underlying distribution of p-values and you ask about the probability that, if we sample again, we get a p-value over 0.051?

      Some additional assumptions are required to say that the probability is pretty large. If you know that the null hypothesis is true then the distribution is uniform in [0 1] and the probability is 94.9%. (But if you knew that already what’s the point?)

      If you didn’t know anything about the distribution you wouldn’t have a reason to think that the first sample is “low” (below the median value) and the second sample is likely to be higher.

      • pardon my lack of statistical knowledge, but I’m puzzled by your explanation of p-filtering.

        If you draw 100 random samples of five individuals each from a normally distributed population, from what I know there’s a five percent chance of p<0.05 in a purely random sample, so only five of your samples should have p<0.05, yet your graph shows dozens. How do you get dozens of samples w/ p<0.05 from 100 samples of a normally distributed population?

        • > only five of your samples should have p<0.05

          That would be assuming a couple of things that do not apply in this example:

          The p-value is calculated using the sample estimate of the standard deviation instead of considering the actual value sigma=1 known.

          More importantly, the true value of the parameter is mu=1 while the null hypothesis is mu=0.

  5. I agree with Michael Lew. Fisher’s suggestion amounts to a meta-analysis which under a Neyman-Pearson approach corresponds to controlling a per-comparison error rate. Family-wise error rate control and multiplicity adjustments can be useful, but they are not required. This meta-analytic approach is congruent with Bayesian practices. It’s simply a matter of whether one prefers an objective definition and interpretation of probability or an unfalsifiable one. Are we measuring the experiment or the experimenter?

      • I do, but it is not a matter of opinion. One interpretation of the posterior is that is measures one’s belief about a population-level parameter, say, theta. Another interpretation is that the unknown fixed true theta was randomly selected from a known collection or prevalence of theta’s, and the observed data is used to subset this collection forming the posterior. The unknown fixed true theta is now imagined to have instead been randomly selected from the posterior. A third interpretation is that all values of theta are true simultaneously; the truth exists in a superposition depending on the data observed. Ascribing any of these interpretations to the posterior allows one to make philosophical probability statements about hypotheses given the data. The Bayesian interpretation of probability as a measure of belief is unfalsifiable — it is not a verifiable or factual statement about the actual parameter, the hypothesis, nor the experiment. It is a statement about the experimenter. Who can claim to know the experimenter’s belief better than the experimenter? The experimenter can never be wrong no matter what he believes. Only if there exists a real-life mechanism by which we can sample values of theta can a probability distribution for theta be verified. In such settings probability statements about theta would have a purely frequentist interpretation. If the prior distribution is chosen in such a way that the posterior is dominated by the likelihood or is proportional to the likelihood, Bayesian belief is more objectively viewed as confidence based on frequency probability of the experiment. It is more appropriate to view the posterior as an approximate p-value function.

        • Geoff:

          That’s kinda funny—you’re expressing an opinion (that Bayesian models are unfalsifiable) and saying “it is not a matter of opinion.” I guess that’s your opinion that it’s not a matter of opinion.

          Meanwhile, I’ve been checking and falsifying Bayesian models for over 30 years!

          You can use the word “belief” all you want . . . whatever. The Bayesian’s prior distribution is a mathematical model, just as the classical statistician’s logistic regression is a model. The Bayesian “believes” in the prior in the same way that R. A. Fisher or whoever “believed” in logistic regression. I think the term “assumption” is more helpful than “belief.”

          Beyond this, I recommend you read my two articles linked in the above comment, along with chapter 6 of Bayesian Data Analysis.

        • Geoff, one thing going on here is you are basically pointing out a feature of how probability works, and then acting like it is a bug. “The experimenter can never be wrong no matter what he believes”. This is trivially true in the sense that, with a probability model, so long as you did not ascribe probability zero to some event A happening, when and if A happens you are never categorically “wrong”. Think of all the hand-wringing about whether Nate Silver was “wrong” when he said that HRC had a 74% chance of winning or whatever in 2016. Although the assumptions and predictions should be checked and vetted, much of that discourse is in the category of “not even wrong”.

          But the thing is that the Bayesian mechanics give us a way of updating based on data. So your posterior will converge to the “unknown fixed true” parameter provided you turn the machinery correctly. This does not require that we “verify” a probability distribution for theta, whatever that means. You can see this for yourself by estimating say the mean of a group or the difference between two groups with either a normal or an exponential or a gamma likelihood, provided the data are positive. The data do not have to “look like” any particular distribution, nor do your prior models on the parameters. Seriously, go do this – it is deeply instructive to intuition about what this whole business is all about.

          And as Andrew likes to point out, there is nothing stopping us from stepping back from our probability models – which are just mathematical machinery and assumptions – and adopting a hypothetico-deductive approach toward the whole business. We can and do in practice discard, modify, extend and/or embed our models in larger classes of models.

          Anyway, I think you are punching at shadows here.

        • It turns out even more so that we can’t verify frequentist models by collecting data either. At least, if you want to verify say that a normal distribution fits you will need to collect vastly more data than we actually collect. For example someone collects say 20 data points in each category and does a t test. How do we know that the process they are collecting from is not say a mixture of 99% normal distribution and 1% Cauchy, leading to there being no average to t test?

          So we collect 200 data points… But it could be 0.1% Cauchy…. Better collect 1000 data points, but it could be .001% Cauchy… Let’s just collect a trillion data points… Still…

        • Chris, Daniel:

          I think what’s going on is that some people grew up (in an intellectual sense) in an anti-Bayesian environment, where they were fed a consistent message that Bayesian inference is a big joke. Then they go out into the larger world and reveal their views, not even realizing how clueless they are. Arguments such as yours can be helpful, but I’m guessing that for people like Geoff to make progress they first have to step back and consider that they might be working within a decades-out-of-date understanding of what Bayesian statistics is. This discussion might help.

        • I’m sorry but interpretation matters. To act enlightened while speculating about my statistical upbringing and suggesting I need progress is offensive. Additionally, to suggest that a Bayesian interpretation is different but equally valid to any other interpretation is misleading. My point is, when we make a statement like, “there is 11.9% posterior probability that theta is less than or equal to 6,” what does this mean?

          It is not an opinion to state that belief/opinion is unfalsifiable. Any elementary textbook on logic should make this clear. When you choose to interpret a Bayesian prior distribution as a model assumption and not as subjective belief, you must use either the second or the third interpretation I provided above when making inference on the unknown fixed true theta under investigation. (If you have another I would be happy to hear it.) Both interpretations are untenable.

          The second interpretation leads to a contradiction because of a change in sampling frame – the unknown fixed true theta is claimed to have been sampled from both the prior and the posterior. Additionally, the assumed prior distribution can only be verified if there is a real-life mechanism by which we can sample values of theta. The third interpretation leads to a contradiction because of a reversal of cause and effect – the population-level parameter depends on the evidence observed, but the evidence observed depended on the parameter.

          Either contradiction nullifies the Bayesian inference machine as a valid form of inference. Bayesian math is elegant, but interpretation matters. Just because someone can push through a set of operations does not mean that what comes out the other side is “right.” Math without interpretation is meaningless. Only if the math behind Bayesian inference is viewed as an approximate frequentist meta-analytic testing procedure is it tenable. The prior distribution is a user-defined weight function for smoothing and reshaping the likelihood, producing approximate p-values. The prior is not a legitimate probability distribution for theta. If we instead interpret probability as measuring belief then we can claim the prior is a “probability” distribution, but this is not a verifiable statement about the actual parameter, the hypothesis, nor the experiment. It is at best a statement about the experimenter.

          To Daniel’s point about verifying a frequentist model, we can only bring evidence against or in support of a model. We can never prove that it is right or wrong. The collected data serve as this verifying evidence, and yes the more data points sampled the better. In practice there is no legitimate prior distribution from which we can sample theta’s to verify the assumed prior is correct. Your argument does not serve you well since a Bayesian model must also specify a probability model for the data generative process.

          To Chris’s point about convergence, you are relying on the long-run and asymptotic performance of the likelihood. There is nothing about the Bayesian paradigm that facilitates this.

        • turtles all the way down dude

          in the meantime we use bayesian (and nonbayesian) methods to solve real problems

          nothing special about bayes, it’s just one way to go

          everything can be tested

          no need to be offended, i’m just trying to help, and of course you need progress, we all need progress

        • > “there is 11.9% posterior probability that theta is less than or equal to 6,” what does this mean?

          The best logical interpretation of this is:

          “If you assume that credibility statement ABC is true, and you observe data DEF, then you must also believe that credibility statement ‘there is 11.9% credibility that theta is less than or equal to 6’ is true”.

          This is a verifiable claim, it’s a claim about a calculation, of the same form as “if 1+1 = 2 and addition works like such and such then you must also believe that 2+2 = 4”

        • Note how it has the same logical structure as the Frequentist statement “if you assume the world is a Normal Random Number Generator and you observe data D then you would only see t(D) greater than 3 at most 3% of the time that you observe a data set of the same size as D”

          There is no assumption free inference.

        • It is interesting that your assertions of “untenability” have not been previously understood by any of the luminaries of Bayesian inference: from Laplace down through Savage, Lindley, De Finetti, now folks like Andrew Gelman, Aki Vehtari, Persi Diaconis etc. It really is remarkable!

          Rather than re-hash how all of these methods for *doing real modeling work* rely on assumptions, which, taken too literally, are provably false in some way, I’d like to understand more why you assert:

          “To Chris’s point about convergence, you are relying on the long-run and asymptotic performance of the likelihood. There is nothing about the Bayesian paradigm that facilitates this.”

          This is factually incorrect. There are large-sample theorems and asymptotics verifying when and under what conditions the Bayesian posterior has this behavior, see e.g. BDA3 I think it’s Chapter 4 to get started. Mainly, so long as your prior does not assign the “true fixed unknown parameter value” to a measure zero, i.e. you impose a positive-only model when the parameter could plausibly be negative. Another pathway you might find interesting is to study the Representation Theorem, if you want to see more formally how to relate frequency properties in data to prior distributions on parameters, what exchangeability means, etc.

        • Do you really think it’s meaningless if, say, a doctor tells you she’s ~90+% confident you have some disease based on test results, base rates, past experience, some reasonable simplifying assumptions, etc? Is it meaningless to say that the Cohen’s d effect size of a psychological study is probably not greater than 5, so it’s a good idea to penalize larger values in our estimation?

          It’s worth noting that even extreme subjectivists like de Finetti thought that conclusions should be evaluated (this is a point missing from Gelman’s excellent paper). We know a lot more than him about *how* to evaluate results, but the idea that evaluation is important isn’t new. It’s also not incoherent: de Finetti thought Bayesian probabilistic models were *tools* for making predictions and some tools are better than others. This is a good summary of his view:

          > “The idea of improving probability evaluation is far from being alien to the subjective viewpoint, in spite of the widespread misunderstanding whereby subjectivism is some sort of an “anything goes” approach, according to which in no way can probability judgments can be judged right or wrong, once coherence is satisfied.17 Such a misunderstanding has obviously been suggested by de Finetti’s assertion that “probability does not exist” printed in capital letters in the preface to the English edition of Teoria delle probabilità.18 De Finetti’s claim has been taken to mean that probability evaluations, as the expression of personal feelings, can be made without taking into account empirical evidence. On the contrary, the assertion was meant to reject the metaphysical tenet that probability is an objective feature of phenomena. However, while objecting to metaphysics, de Finetti firmly insisted that empirical evidence is an essential ingredient of probability assessment that must not be neglected. He makes it clear that the evaluation of probability “depends on two components: (1) the objective component, consisting of the evidence of known data and facts; and (2) the subjective component, consisting of the opinion concerning unknown facts based on known evidence” (de Finetti 1974: 7). The effectiveness of subjective probability as a tool for prediction can and must be assessed, and the criterion for it is the success of forecasts. As de Finetti puts it: “though maintaining the subjectivist idea that no fact can prove or disprove belief I find no difficulty in admitting that any form of comparison between probability evaluations (of myself, of other people) and actual events may be an element influencing my further judgment, of the same status as any other kind of information.” (de Finetti 1962: 360). In other words, de Finetti took very seriously the problem of objectivity, or the “goodness” of probability evaluations, and actively worked on the topic, partly in collaboration with Jimmie Savage, pioneering a thriving field of research.”
          https://journals.openedition.org/ejpap/1509

        • All that matters is performing impressive feats and making surprising predictions. It has been the same since Archimedes at least.

          There is no need for writing so many paragraphs about this unless you want people to listen to you or give you funding without doing any of that.

        • The other reading of Geoff’s comment about the likelihood is that he is unaware that there is a Bayesian derivation/motivation of the likelihood function as part of the joint distribution. Ironically, certain strands of Bayesians are the most purist (wrongly so IMO) about the “Likelihood principle”, that data shall only enter the analysis via the likelihood. IIRC, one of Lindley’s more acerbic remarks was a rejoinder to Brad Efron writing about bootstrapping where he said something like “No wonder he rejects the Likelihood Principle because he doesn’t have a likelihood about which to have a principle”.
          In fact, I have always found it easiest and most useful to understand Maximum Likelihood as approximate Bayes, rather than Bayes as Likelihood + Priors. Sure, there are Frequentist ways to motivate Max Likelihood, but they are not any more assumption-free than Bayesian ones…

  6. This whole discussion is a bit weird to me. Isn’t p-hacking more typically about changing your analysis methodology to achieve that fabled p<0.05?

    On the face of it, *any* statistical advice that advises against obtaining more data is deeply unwise. Perhaps the introduction of "p-hacking" into this scenario was itself rather pointless – consider for example that if Fisher advised the researcher instead to throw out the existing data and do a new experiment, hoping to get p<0.05 that time… now that would be illegitimate, right?

    As long as the experimenter is collecting new data and analysing together with the old data, they should be somewhat safe from a probabilistic POV.

Leave a Reply

Your email address will not be published. Required fields are marked *