The Notorious N.H.S.T. presents: Mo P-values Mo Problems

A recent discussion between commenters Question and Fernando captured one of the recurrent themes here from the past year.

Question: The problem is simple, the researchers are disproving always false null hypotheses and taking this disproof as near proof that their theory is correct.

Fernando: Whereas it is probably true that researchers misuse NHT, the problem with tabloid science is broader and deeper. It is systemic.

Question: I do not see how anything can be deeper than replacing careful description, prediction, falsification, and independent replication with dynamite plots, p-values, affirming the consequent, and peer review. From my own experience I am confident in saying that confusion caused by NHST is at the root of this problem.

Fernando: Incentives? Impact factors? Publish or die? “Interesting” and “new” above quality and reliability, or actually answering a research question, and a silly and unbecoming obsession with being quoted in NYT, etc. . . . Given the incentives something silly is bound to happen. At issue is cause or effect.

At this point I was going to respond in the comments, but I decided to make this a separate post (at the cost of pre-empting yet another scheduled item on the queue), for two reasons:

1. I’m pretty sure that a lot fewer people read the comments than read the posts; and

2. I thought of this great title (see above) and I wanted to use it.

First let’s get Bayes out of the way

Just to start with, none of this is a Bayes vs. non-Bayes battle. I hate those battles, partly because we sometimes end up with the sort of the-enemy-of-my-enemy-is-my-friend sort of reasoning that leads smart, skeptical people who should know better to defend all sorts of bad practices with p-values, just because they (the smart skeptics) are wary of overarching Bayesian arguments. I think Bayesian methods are great, don’t get me wrong, but the discussion here has little to do with Bayes. Null hypothesis significance testing can be done in a non-Bayesian way (of course, just see all sorts of theoretical-statistics textbooks) but some Bayesians like to do it too, using Bayes factors and all the rest of that crap to decide whether to accept models of the theta=0 variety. Do it using p-values or Bayes factors, either way it’s significance testing with the goal of rejecting models.

The Notorious N.H.S.T. as an enabler

I agree with the now-conventional wisdom expressed by the original commenter, that null hypothesis significance testing is generally inappropriate. But I also agree with Fernando’s comment that the pressures of publication would be leading to the aggressive dissemination of noise, in any case. What I think is that the notorious N.H.S.T. is part of the problem. It’s a mechanism by which noise can be spread. This relates to my recent discussion with Steven Pinker (not published on blog yet, it’s on the queue, you’ll see it in a month or so).

To say it another way, the reason why I go on and on about multiple comparisons is not that I think it’s so important to get correct p-values, but rather that these p-values are being used as the statistical justification for otherwise laughable claims.

I agree with Fernando that, if it wasn’t N.H.S.T., some other tool would be used to give the stamp of approval on data-based speculations. But null hypothesis testing is what’s being used now, so I think it’s important to continue to point out the confusion between research hypotheses and statistical hypotheses, and the fallacy of, as the commenter put it, “disproving always false null hypotheses and taking this disproof as near proof that their theory is correct.”

P.S. “The aggressive dissemination of noise” . . . I like that.

45 thoughts on “The Notorious N.H.S.T. presents: Mo P-values Mo Problems

  1. Affirming the consequent! Of course it’s that. I hadn’t thought of it in such stark terms.

    Is it fair to say that, your view, a better statistical goal than hypothesis testing would be parameter estimation? What advice would you give to a practicing statistician on how to help clients think about their scientific questions in what is going to be a new way for many of them, since so many people have been trained to ask questions that produce (ostensibly) binary answers?

    • Erin: Make sure your parameter estimate is big enough to tell a story and earn your fee

      Tongue in cheek of course but suppose we only reported estimates and posterior probability intervals. Will journal editors looking for newsworthy effects have an implicit cutoff? And if they do, will researches go looking for that marvelous interval shifted a good fuzzy distance from zero?

      Which goes back to the question of why we have the silly standard we have.

      • PS In fact the “fuzzy” cutoff may induce an arms race.

        Suppose we go from a standard that says “a p-value 0. In which case the only constraint might be time, passing laugh test, etc…

        If we are going to solve these problems we need to do a lot more research on how researchers research. What I and others refer to as “research practice”.

      • PPS greater than signs garbled previous comment. here corrected:

        PS In fact the “fuzzy” cutoff may induce an arms race.

        Suppose we go from a standard that says “a p-value less 0.05 is necessary for publication” to “a big effect is necessary for publication”.

        In the former you stop hacking once you meet the cutoff (assuming p-values lower than 0.05 have no effect on probability of publication).

        In the latter you may continue as insurance, to be better than your putative competitor, or because the derivative of the probability of publication with respect to effect size is always greater than 0. In which case the only constraint might be time, passing laugh test, etc…

        If we are going to solve these problems we need to do a lot more research on how researchers research. What I and others refer to as “research practice”.

        • The sociology of science totally fascinates me. On the other hand I feel like we have a tail-eating problem because I imagine this very same epistemological issue exists in sociology. Perhaps an ethnography of science is more tractable, but we’re still stuck then about how to move from the particular to the general…

        • (Hah! Food for your ego, maybe: I thought, hm, that was an interesting comment, I should follow his link — turns out I already follow you on Twitter.)

      • Fernando: Sometimes you really want to know whether your data is more likely from “signal+noise” than from “noise-alone”.

        I am sympathetic to concern about testing null hypotheses (i.e. signal-rate=0) that one doesn’t believe are true to begin with. And determining credible interval on the signal rate is absolutely necessary. But at the end of the day, I still want to have some quantitative estimate of how likely at least one of those blips from my detector is a real signal.

        • West:

          I am not defending NHST necessarily, though I do find it useful as one aspect of statistical inference (not the only one).

          Rather I am questioning the idea that getting rid of it will get rid of the problems in science. Andrew says they are an enabler. Sure. But to me it is like wagging the dog by the tail. Get rid of NHST and the symptom will appear elsewhere.

        • Fernando: I was for better or worse generically defending the use of hypothesis testing, though I preferring using Bayes factors over the strawman NHST. But I admit to regularly using the latter and struggling with its drawbacks like the multicomparison problem.

          Let me try to encapsulate the core of your argument. “The community is sufficiently clever that it will devise data analysis procedures that will allow its members to satisfy the problematic structural incentives for researchers. The prohibition of a method of testing will make little impact, despite that one’s ubiquitousness, because it doesn’t address the root of the problem.” If I understand you right, that optimistic cynicism seems reasonable to me.

        • Question: You are of course right that strawman NHST can only challenge the validity of the null. I naively thought the discussion was concerning NHST in general (whichever your probabilistic proclivities) rather than about the strawman specifically. Apologies for not making that explicit.

    • The stock answer of statisticians seems to be to look at the confidence/credible interval – all the parameters the data has not reduced the possibilities [probability, confidence, likelihood, empty gas cans, etc.] of.

      But my experience is that most people are not able to fully process this and have been wondering about the use of the counternull (as providing better human information processing as opposed to optimal information.)

      http://en.wikipedia.org/wiki/Counternull

        • I was hoping to get that feedback.

          My guess is folks are taking p >.05 as meaning just noise, just to be disregarded.

          Perhaps with the second loneliest number of hypothesis (2), the second one indicating an interesting effect that would just as often lead to data like this as no effect would – might be the best _baby_ step for them.

          (I am thinking of pre-registered focused prior consistent question with high quality RCT data and no surprises – best case situation.)

    • Erin:

      We used to say that it would be better to report conf intervals instead of p-values. But recently I’ve been thinking that CI’s are not much of a solution. The real problem, I think, is the reporting of inferences that don’t use prior information. I discuss these points here.

      • But there are different ways to include prior information other than directly in the estimating model, right? How about the best IV papers that lay out convincing rhetorical and quantitative arguments for the validity of the instrument, make specific predictions about the endogeneity they are overcoming and the directional bias of the reduced-form estimates, and then show a statistically significant 2SLS estimate.

        We certainly wouldn’t, absent the first steps, believe the result was true in the world just from the p-values. But taken as a whole these kinds of statistical arguments can be very compelling because they effectively harness outside information and relate it to the statistical model. They just don’t do it directly through model specification.

      • It never has felt like much of a solution to me, honestly, though because I backed into stats via social science I have never been sure that that wasn’t just cluelessness on my part. But I think a lot of people presented with 95% CIs will just look at the endpoints and go “is zero included? No? Great!” and not think any more about what the parameter means. Whereas I suspect the useful end of the CI is actually at the far-from-zero side, to give you some idea about whether your finding ought to pass the laugh test.

        Thanks for the link to your paper. I’ll add that to the “after the semester” reading pile.

    • Questioner surely didn’t mean “affirming the consequent” in a list of supposedly good science practices, maybe he meant denying the consequent as in modes tollens.

      • Mayo,

        I meant affirming the consequent is the error being committed. It is impossible for disproving a strawman to have any use at all other than to lead to this error. “If the drug does something the treatment group will be different from the control group, the treatment group is different from control group, therefore the drug does something”. I can see no possible use for disproving a strawman other than to commit this error.

  2. The flipside of the problem is the aggressive dissemination of silence: silence over the posterior distribution (or parameter values supported by the data, whatever one wants to call it). Perhaps we can have a John Cage themed post next?

  3. Isn’t one problem that many users will actually interpret a positive estimate of theta, and significance against theta = 0, as disproof of “theta < small". And since the latter could have been true, the logical trap is far harder to see.

    I've definitely encountered people looking at significance tests against theta=0 who, if confronted, think the objection (i.e. that this is almost surely false) as silly pedantry; for while they may have been trained to run two sided tests and then this may be how the formalism looks, they _know_ they are "really" interested in a one-sided result and thus rejection of theta = 0 (together with estimate having the 'right' sign) has logical substance to them.

    • bxg,

      Knowing only the “right sign” cannot be more useful scientifically than knowing the sign plus magnitude. Even under ideal conditions strawman NHST is extraneous.

      • I’m not sure this is correct, or at least depends on a particular definition of scientific “utility”. In extremis, people might be willing to advise an action that _might_ help somewhat (unknown degree), so long as they have satisfactory evidence that it won’t be harmful; I can imagine that could even be legally significant. In science, knowning the direction of an effect itself might be regarded as a valid advance in knowledge (realizing that electrons repel each other is a scientifically contentful statement, even before having a more refined electrodynamics that help you work out how much.)

        But that is not my point! My point is that the objection “you are rejecting a surely-false hypothesis, therefore it can’t be that interesting as a matter of simple logic” sometimes (and I have personal experience of this) falls on entirely deaf ears, because the practictioner think he is really doing something else (no matter what the formalism says). So while there might be a real and consequential objection to be made, it will be better heard using a more nuanced argument than “H_0 is always false”. Theta <= 0 is not always false. Theta is "very small" (another fallback informal reading) is not always false. IMO it would be more useful to state one's concerns about NHST that encompasses these informal readings as well. And/or is upfront: these informal interpretations are just wrong, and so…

        • “In science, knowning the direction of an effect itself might be regarded as a valid advance in knowledge (realizing that electrons repel each other is a scientifically contentful statement, even before having a more refined electrodynamics that help you work out how much.)”

          I agree that the direction is a contentful statement, however you will need to have an estimate of the magnitude to perform NHST. In the case when the null hypothesis is a strawman and you can calculate a p-value, determining whether or not it is “significant” seems to add nothing.

          “My point is that the objection “you are rejecting a surely-false hypothesis, therefore it can’t be that interesting as a matter of simple logic” sometimes (and I have personal experience of this) falls on entirely deaf ears, because the practictioner think he is really doing something else (no matter what the formalism says).”

          I think all the misunderstandings are a distraction. They exist because people refuse to believe that what they are doing is pointless. Even under ideal circumstances, when we know the direction of the effect with certainty, this information alone cannot lead to a quantitative theory. The estimate of the effect (for example “very small”) is needed, and this must be estimated to perform NHST anyway, the steps after this seem to serve no purpose (when the null is a strawman).

  4. So-called NHSTs, in which a statistically significant result is taken as evidence for a substantive alternative (that might allegedly “explain” the result) exists only as an abuse of tests and not as any technique of mathematical statistics. It has been invented (presumably by psychology researchers) and was only alluded to as the worst kind of howler by founders of tests from any school. So ban “NHST” as enabling illicit inferences warranted with 0 severity!*

    (*The errors that would need to have been probed to warrant the substantive alternative have not been ruled out or even probed by dint of rejecting the null)

    • Mayo,

      The “hybrid” appears to have been invented by EF Lindquist in 1940. At least this is the first appearance of it. Please distinguish between NHST used to disprove a strawman and NHST used to test the predictions of a theory. Any method that disproves a strawman is useless.

    • Mayo:

      Indeed, NHST is an abuse of statistical ideas. Unfortunately, it’s an abuse that happens all the time, indeed it may be a case where the abuse is more common than theoretically appropriate approaches. And the people who are doing this abuse often seem to think they’re following the theory. This suggests to me that the theory has problems, in that it doesn’t do what people seem to want it to do.

      And even theorists get tangled in these issues, for example writing that classical 95% intervals contain the true value 95% of the time, even when they don’t (because of systematic error that is not in the model). One problem is that statistical methods (classical and Bayesian both, in different ways) are sold as offering mathematical guarantees. But these guarantees only work conditional on assumptions which are generally false.

      • Andrew,

        I made a point below pointing out that you have misattributed a post by Fernando to me. You appear to have overlooked it. It’s bugging me.

  5. First off, great title.

    More on topic, Question’s statements, despite containing some truths, seem themselves to be strawmen, set up just to be rejected. But of course we won’t use these to infer that Bayes is better – see we just warned against that! No one would would make that logical error despite the current fashion etc. (The Bayes thing is boring, yes, but is still there! Just like these same NHST comments).

    Slightly more constructively – I agree with the emphasis on substantive/research questions, and prefer to deal with mechanistic mathematical models far more than I like using most statistical methods, but what, say, is Question’s specific proposed method for falsifying models?

    • hjk,

      In order to falsify a model your model must make a testable prediction and your analysis must test this prediction. The exact same NHST procedure using non-strawman nulls seems acceptable (although I am not convinced it is optimal) to me.

  6. “I agree with the now-conventional wisdom expressed by the original commenter, that null hypothesis significance testing is generally inappropriate.”

    This may be conventional wisdom on this blog, but it certainly is not within the broader research community in academia/industry.

  7. So am I supposed to remove NHST from my toolbox because I can’t help but hurt myself? Or should I just be very careful when using it and remember that it might not be the right thing to use for all problems? Just looking for some prescriptive guidance.

    As a physics graduate student doing astrostatistics, a test with the null hypothesis H_{b}being the data is only background noise (i.e. the signal rate is zero) doesn’t seem ridiculous at all. Particularly if it’s tested with simulated data first and the ROC curve looks good. To be clear my results also come with posterior parameter estimates for both H_{b} and the signal+noise model H_{s+b}. While I believe there are sources out in the universe or I wouldn’t be looking, it is not an unrealistic possibility that the signal rate is effectively zero.

    • West,

      I can’t tell from your description if that is a strawman null hypothesis or not. If it is NOT a strawman (it is actually a plausible occurrence), my complaints are not directed at it.

        • West,

          My position is that if the null hypothesis is a strawman, it is impossible that disproving it can have any contribution to science that the estimate of effect magnitude necessary to perform NHST would not.

          The example on that page already has a plausible null hypothesis based on the distribution of previous data from “safe suitcases”, I am not criticizing that use. If instead it was “Disprove the Geiger count reading is zero, if greater than zero, check the suitcase” that has no usefulness. If you have ever played with a Geiger counter or survey counter you know there is background “noise” always present, so disproving that it is exactly zero is pointless. I do not see how that latter hypothesis is any different from “treatment and control groups are exactly the same on average”.

  8. Pingback: ScienceSeeker Editors’ Selections: March 30 – April 5, 2014 | ScienceSeeker Blog

  9. Pingback: Jessica Tracy and Alec Beall (authors of the fertile-women-wear-pink study) comment on our Garden of Forking Paths paper, and I comment on their comments « Statistical Modeling, Causal Inference, and Social Science Statistical Modeling, Causal Infer

  10. Pingback: Undergrads Don’t Like Learning About it Anyways | mahoneypsych4850

Leave a Reply to question Cancel reply

Your email address will not be published. Required fields are marked *