The problems with p-values are not just with p-values.

From 2016 but still worth saying:

Ultimately the problem is not with p-values but with null-hypothesis significance testing, that parody of falsificationism in which straw-man null hypothesis A is rejected and this is taken as evidence in favor of preferred alternative B. Whenever this sort of reasoning is being done, the problems discussed above will arise. Confidence intervals, credible intervals, Bayes factors, cross-validation: you name the method, it can and will be twisted, even if inadvertently, to create the appearance of strong evidence where none exists.

I put much of the blame on statistical education, for two reasons:

First, in our courses and textbooks (my own included), we tend to take the “dataset” and even the statistical model as given, reducing statistics to a mathematical or computational problem of inference and encouraging students and practitioners to think of their data as given. . . .

Second, it seems to me that statistics is often sold as a sort of alchemy that transmutes randomness into certainty, an “uncertainty laundering” that begins with data and concludes with success as measured by statistical significance. Again, I do not exempt my own books from this criticism: we present neatly packaged analyses with clear conclusions. This is what is expected—demanded—of subject-matter journals. . . .

If researchers have been trained with the expectation that they will get statistical significance if they work hard and play by the rules, if granting agencies demand power analyses in which researchers must claim 80% certainty that they will attain statistical significance, and if that threshold is required for publication, it is no surprise that researchers will routinely satisfy this criterion, and publish, and publish, and publish, even in the absence of any real effects, or in the context of effects that are so variable as to be undetectable in the studies that are being conducted.

In summary:

I agree with most of the ASA’s statement on p-values but I feel that the problems are deeper, and that the solution is not to reform p-values or to replace them with some other statistical summary or threshold, but rather to move toward a greater acceptance of uncertainty and embracing of variation.

23 thoughts on “The problems with p-values are not just with p-values.

  1. People need to come up with some theory about what id going on and derive a quantitarive model from it.

    Just look at the data in Fig 2 or Fig S2 in the previous colonoscopy post. Everyone is worried about ITT vs per-protocol when it isn’t even relevant if you treat the problem scientifically.

    Obviously cancers are forming and being detected at whatever rates. First, which of these (if either) is the rate limiting step to an eventual diagnosis?

    Either way, the combined rate looks to be approximately constant in the absence of screening, but those in the “no screening” group may have began deviating from “non-participants” about year 5 (Fig S2).

    This is interesting because surveillance protocol changed from start-2010 vs 2011-2014 (page S19), which may reflect some important behavioral difference between usual care and those who declined screening.

    Then we see a strikingly different curve in the screened group. The cancer diagnoses increase very fast for two years then level off to a rate much less than for unscreened patients. We can devise a number of possible explanations for this.

    Maybe the screening detected cancers early that would have otherwise been diagnosed at the usual rate.

    Maybe the screening is finding cancers that would have never been diagnosed otherwise.

    Maybe choosing to get screened is related to problematic symptoms.

    Maybe screening can damage the colon and cause cancer.

    Maybe some combination of the above. Obviously the first thing to look at is rate of screening, what was the distribution of days since randomization that screenings were performed? When you add this to the usual rate of diagnoses does it match the shape of the curve?

    The point of science is to try figuring out what is going on. Then once you think you have a good explanation, you test it on new data. It is not difficult at all to come up with multiple quantitative post-hoc explanations that fit just as well.

    Just seeing if group A is higher/lower than group B is a waste of resources.

    • If only everyone was as smart as you, no trials would be necessary and no analysis. I’m sorry but I think screening for colon cancer is a big deal and the evidence from the study just discussed tells us quite a bit. No, it doesn’t answer all questions nor are any of the analyses perfect. But, as you often do, you denigrate almost everything anyone says in favor of your unique insights. As I said, if only everyone was a smart as you…

      • We recently emerged from a thousand years where the most intelligent and educated members of society spent their time arguing about theological questions we find totally irrelevant today. Now people have already come up with new pointless things to argue about like this per-protocol vs ITT issue.

        The problem I’m pointing out has nothing to do with intelligence. People just need to work out what they think may explain the observations, then when they find a few that fit compare them to new/different data.

        Like apparently without screening this population gets diagnosed with colon cancer at a near constant rate of ~0.1% per year. Why is it around that value?

  2. “we tend to take the “dataset” […] as given”

    I don’t understand this argument. So the theory you made up is the one and true path to understanding reality because data is untrustworthy? In my work, I was expected to make sense of the data that was measured from the system of interest. It was never impossible that the data was unrepresentative, but I don’t remember that ever happening. (The data was not survey answers from undergraduates.)

    The engineering paper posted awhile back by Gelman senior touches on the subject of the fidelity hierarchy. Measured data always takes precedence over theory and math model output. Specifically, if your theory-based math model output disagrees substantially from your measured results, you do not go out and replace the flowmeters because the data must be wrong, you troubleshoot the math model.

    I do realize that when you have good reasons to believe the data is untrustworthy, things are not so simple, but I think these cases are relatively rare and not a part of any epistemology related to fidelity. Yet I do occasionally see scientists write that since the data does not agree with their model the data must be rejected.

    I think a scientist should always begin by assuming the data is accurate but be willing to do the heavy lifting of figuring out whether that assumption is valid somewhere along the way.

    • I think it does depend on how much you trust your theory/mathematical model, and that it’s often quite useful to go back and check that the data was collected properly and/or corresponds to the model analog in the way you think it does.

      The “FTL neutrinos” paper was an extreme example of a case where, if the experimental data disagrees with the model, the data is probably wrong.

      Even without strong, trustworthy models, it’s hard to know what to measure or what those measurements without some (possibly implicit) theory.

      Especially in social sciences, it’s unclear what the empirical analog of the model construct is: which measure of capital stock or inflation or unemployment or labor market tightness should we use? Should we use self-reported race or some measure of perceived race? Are Facebook friends really what we mean by social networks? What is Wells Fargo’s share of the deposit market, and what are the relevant boundaries of the market anyway? Are people interpreting our survey questions in the way we think they are?

      I don’t know if any way to think about those problems except through the lense of some model, and weird results often point to problems with how we’ve measured things.

      • I can’t resist this. I had a couple of degrees of separation from the OPERA team that initially claimed that a few neutrinos has broken the universe’s speed limit, and I blogged on it back when it was happening. At the end of my post I wrote:

        “Note that these neutrinos made their trip through the rugged terrain of the Alps and Apennines at an “impossible” speed. I think they should be checked for doping, and if they turn up positive they should be disqualified.”

    • This is not the way scientists have behaved in the history of science, and I don’t really see a reason for why data should be elevated over theory on initial appraisal. See Einstein dismissing Dayton C Miller’s Aether Drift as a thermal artifact (which was correct) or Mendeleev’s periodic table of elements or any other examples that Lakatos gives.

      Facts / data collectively control theories over the long-haul but taken singly scientists have always allowed theories to control the facts. Larry Laudan discusses this extensively in his book “Progress and Its Problems” where he illustrates a number of different factors that scientists look at when appraising theories and facts (i.e. how entrenched is an existing theory). Heck if you subscribe to the Kuhnian idea that all facts are theory laden, it becomes impossible to elevate one over the other; I don’t quite believe this view as a universal, but it is true in many areas, and points towards not having rigid stances of always initially taking facts over theory or vice-versa (which is something I do agree with).

    • Jonathan, Allan:

      Theory is always needed, if for no other reason than to connect observed data to underlying constructs of interest. Many of the bad research we’ve criticized over the years has had the form of, “We have used a low p-value [or Bayes factor] to demonstrate a relation between X and Y,” where X and Y are extremely noisy measures of whatever is purportedly being studied. An extreme example was the sentence, “That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications,” in a paper in which there was no actual measure of power. Another example was the paper that referred to “long-term” effects but “long term” was only 3 days.

  3. Matt:

    No, I don’t think “the theory I made up is the one and true path to understanding reality.” Not at all! First, I don’t think there’s any one path to understanding reality. Second, of course I don’t think a theory that somebody “made up” is a “true path”!

    When I wrote, “we tend to take the ‘dataset; and even the statistical model as given, reducing statistics to a mathematical or computational problem of inference and encouraging students and practitioners to think of their data as given,” my point was not that data are untrustworthy. My point was that measurement and data quality are important. What I’m pushing against is the idea that a researcher can just take some pile of numbers without asking what exactly is being measured, and then take those numbers and compute a p-value or Bayes factor or whatever and use that to draw a scientific conclusion.

    Here’s the rest of my paragraph from the above-linked article:

    Even when we discuss the design of surveys and experiments, we typically focus on the choice of sample size, not on the importance of valid and reliable measurements. The result is often an attitude that any measurement will do, and a blind quest for statistical significance.

  4. I think Andrew’s point about data is just the tip of the iceberg. It’s not just that measurement can be biased or present error in some general way. Measurement issues are often intertwined with the variables you want to analyze, and patterns that appear in the analysis can be used diagnostically to get at them. To put it differently, there may not be a clean separation between inference and measurement questions.

  5. I think part of the reason this problem is so stubborn is that it’s relatively easy to spot logical problems but much harder to address scientific judgment and the roles that different tools end up taking on in the wild (e.g., p-values as publication criteria). This is more or less the same tension that created the problem in the first place – it’s straightforward (if tedious) to explain the hypothesis testing framework and to work out the math in a lot of examples and to test whether students follow along, but it’s much harder to address scientific judgment without endless caveats that leave students unsure that they’ve learned anything real.

    Backing up a bit, a reasonably honest p-value statement would be something like, “as specified, this model does (or does not) find statistically significant evidence that the effect is positive (or greater than 1.2 or whatever) in the available data.” This kind of parsimonious statement isn’t super useful because it leaves wide open contingencies for model mis-specification and data problems. The only advantage of the clunker of a sentence is that it’s all there really is. The trouble starts when people strip away the reminders about the model specification and available data and run with the grossly over-statement, “the researchers found statistically significant evidence that the effect is positive.”

    I think that’s the core rub: Careful scientific communication is often intentionally stingy in drawing conclusions or extrapolating findings beyond the data in hand, so there’s a vacuum for a green light to move from safe, sterile, analysis outputs to meaningful inferences. If p-values didn’t fill that void what would, and what other pathologies might spring from the alternative? I don’t think there’s a math or logic answer to this. It’s more of a process thing.

    I can see why Andrew presses for open access to data, method pre-registration, journal support for open scrutiny (and sometimes retraction) of published papers, and thicker skins all-around: There is no green light, and there never will be. Your results hold up if they hold up, and researchers should manage their reputations accordingly.

    • > it’s relatively easy to spot logical problems … it’s straightforward
      > (if tedious) to explain the hypothesis testing framework

      But, the hypothesis testing framework is illogical. Andrew said as much: “the problem is not with p-values but with null-hypothesis significance testing”.

      Andrew also wrote, “I put much of the blame on statistical education”. Indeed. I once talked to a professor who was a Bayesian. I asked what he taught in his classes. He said taught frequentist statistics because that was what he was expected to teach. At least he was a bit embarrassed to admit this.

    • Josh —
      What are you talking about? Based on your post, I’m not sure that you have any idea what a p-value is or represents. If you actually understood what a p-value represents, then you would not have written your second paragraph as you did.

      • Richard –
        Drive-by insults on the internet are poor form. Perhaps you could take the time to be more specific and constructive.

        In the spirit of the original post, I’m focusing on the way p-values, and the hypothesis testing framework more broadly, play a big role in uncertainty laundering (and on how that function fills what many people seem to perceive as a void). In the end, I agree very much in spirit with the original post’s conclusion that we’d be better off grappling with the reality of uncertainty than looking for a tidy math solution to what amounts to a cultural problem.

        My “honest statement” focused on the part of the hypothesis testing framework that’s most relevant to my point and to the original post. Are you annoyed that I didn’t mention a quantified a confidence level, or that I didn’t speak to the stilted logic of hypothesis testing, or that I don’t bother separating the p-value from the hypothesis test it implicitly addresses, or something else?

        • Josh —
          I apologize for appearing to insult you. Your second paragraph suggests that low p-values provide “statistically significant evidence” in favor of some specific hypothesis. My issue with this is that low p-values do not provide support for alternatives to some hypothesis considered as null. Specifically, in most settings a p-value measures the probability that the empirical result observed in the data would occur if the null hypothesis and all model assumptions are actually “true.” Please have a look at Sander Greenland’s voluminous and terrific body of wprk in this area.

        • > Your second paragraph suggests that low p-values provide “statistically significant evidence” in favor of some specific hypothesis.

          I would say that “evidence that the effect is positive” is as non-specific as an alternative hypothesis to the null hypothesis “zero or negative effect” can be.

        • Richard-
          No harm done. But I would like to dig into this a bit more. As background, I work in industry now, where the state of statistical knowledge is at a level where statements like “R-squared is greater than 0.75, so the regression is statistically significant [and therefore can be trusted]” run rampant. I’ve long labored to clean up the messaging and the thinking in my little corner of the world, but I have to meet people where they are (e.g., armed whatever statistics their universities taught them) and with whatever bandwidth they can spare now (e.g., mid-career with school-age children). The main upshot of this background for my present purpose is that I try very hard to craft statements for people that are both *useful* and *correct*.

          I agree that your statement, “p-value measures the probability that the empirical result observed in the data would occur [under the null hypothesis]” is very literally correct, but I argue that it is too awkward to be useful to anyone seeking to make an inference about the world beyond the existing data. (I’ve dropped the caveat about model correctness because we already agree on that.)

          Basically, I am translating “low probability that the empirical result observed in the data would occur [under the null hypothesis]” to “statistically significant evidence that the null hypothesis is false”. I assume that’s not objectionable (I mean, what else could “statistically significant” possible mean in plain English?)

          If that’s not objectionable, try this one: I want to say that “low probability that the empirical result observed in the data would occur under the null hypothesis” is also (tautologically) equivalent to “statistically significant evidence in favor of the alternative hypothesis”. I would be very curious to know if you (or others) object to this colloquialism or to the previous one.

          I admit that last statement feels weird to write, I think because it kinda rhymes with the logical error people make when they conflate lack of statistically significant evidence of X with statistically significant evidence of not(X). But that seems like a totally separate topic.

        • I don’t like the second quoted ‘equivalent’ statement. While it is true, it is often mistaken for supporting somebody’s favorite alternative hypothesis. If you are testing a null hypothesis that 2 means are equal, and you reject it you are saying your evidence is in favor of them not being equal, but naturally it says nothing about why they are not equal. Usually, there is a lurking hypothesis about why that is the case, but simply rejecting the null of equality does nothing to test the particular hypothesis about why they are unequal. Since that mistake is so commonly made (I’m not saying that you make that mistake, only that many people do), I’d prefer to not use the wording in your second statement.

        • I don’t think there is any way to say this without it being misunderstood. Instead you should turn it into a Bayesian analysis. Many frequentist statements agree with Bayesian statements using priors that are often reasonable. Of course, if yours doesn’t, then you are out of luck. In other words, even though the derivation is wrong, the result might be correct. But, you won’t know until you do the correct derivation.

  6. In my opinion, there is a role for p-values, but it is to answer the question: “Could this result have occurred by chance?” The only two allowed answers are:
    – Yes!
    – Possibly not?

    I’m reminded of a student I once supervised for his B.Sc thesis. He was trying to model the dropout rate of first year students. I’d encouraged him to do a data exploration first, write it up and send it to me (Surprisingly he did! So many students wait until the very end to write everything up!) I was suddenly reading things like “Men are more likely to drop out than women” but when I looked at his graphs of the data I was seeing extremely tiny differences (less than 1% on a quite small dataset).

    So I told him I suspected these differences were due to chance, and he could test that. Second version he sent me read “There is no significant difference in the dropout rate between men and women (p = 0.8).” (Or something like that, I’m pulling the numbers out of thin air and translating from Dutch to EN here…)

    Which was a good early warning for him that, with the data he had, his conclusion might read “there is nothing to conclude here!” Which came as no surprise to anyone except the guy at HR who thought he might be able to somehow save on teaching costs by getting a student to predict how much teaching hours were needed for the 2nd year. (Hint: about the same as last year!)

    • Unfortunately there’s no such thing as “by chance”. There are an infinity of possible random number generators that could be considered as “by chance”. For example, suppose you think “by chance” means “normally distributed with mean 0”. Then you get the data 1.0, 2.0, 1.2, 1.4, 2.3, 2.1 …. well you’d get negative numbers 50% of the time from normal with mean 0, so the probability of getting all positive numbers is 1/2^6 ~ .016 so this is unlikely to be “by chance” according to your model.

      But what if your model is “by chance” means exponential with mean 1.5? All of those numbers are very reasonable numbers to get from an exponential with mean 1.5 random number generator…

      The problem is, there’s no such thing as “by chance” there’s only “by some specific random number generator H”. A hypothesis test is really a test of compatibility with having come from a given specific random number generator (or family of generators with a range of parameter values).

      People often say “this data is not compatible with the null hypothesis of zero mean so the mean must be nonzero”… But the real hypothesis isn’t “of zero mean” it’s always something like “from the distribution with normal(0,1) density” or similar.

    • p-values do not answer the question “Could this result have occurred by chance?” Can a N(0,1) produce a value of a trillion? Sure. The support of the distribution is the whole real line. So, what you really want to know is the probability that it produced that value. But, to determine that, you need to consider what the alternatives are. If you know Bayesian statistics, then you know how to proceed.

      Frequentist statistics has mistakenly taught people that they don’t need to consider alternatives when doing statistics.

  7. Josh —
    Please have a look at Sander Greenland’s important papers. He specfically deals with your basic questio of how to present statistical results. Sander, among many other contributors to this blog and myself, as well, does not support classical null hypothesis significance testing. His excellent body of published work (including his postngs in this blog) should give you a pretty good idea of what to do.

Leave a Reply

Your email address will not be published. Required fields are marked *