What hypothesis testing is all about. (Hint: It’s not what you think.)

From 2015:

The conventional view:

Hyp testing is all about rejection. The idea is that if you reject the null hyp at the 5% level, you have a win, you have learned that a certain null model is false and science has progressed, either in the glamorous “scientific revolution” sense that you’ve rejected a central pillar of science-as-we-know-it and are forcing a radical re-evaluation of how we think about the world (those are the accomplishments of Kepler, Curie, Einstein, and . . . Daryl Bem), or in the more usual “normal science” sense in which a statistically significant finding is a small brick in the grand cathedral of science (or a stall in the scientific bazaar, whatever, I don’t give a damn what you call it), a three-yards-and-a-cloud-of-dust, all-in-a-day’s-work kind of thing, a “necessary murder” as Auden notoriously put it (and for which was slammed by Orwell, a lesser poet put a greater political scientist), a small bit of solid knowledge in our otherwise uncertain world.

But (to continue the conventional view) often our tests don’t reject. When a test does not reject, don’t count this as “accepting” the null hyp; rather, you just don’t have the power to reject. You need a bigger study, or more precise measurements, or whatever.

My view:

My view is (nearly) the opposite of the conventional view. The conventional view is that you can learn from a rejection but not from a non-rejection. I say the opposite: you can’t learn much from a rejection, but a non-rejection tells you something.

A rejection is, like, ok, fine, maybe you’ve found something, maybe not, maybe you’ll have to join Bem, Kanazawa, and the Psychological Science crew in the “yeah, right” corner—and, if you’re lucky, you’ll understand the “power = .06” point and not get so excited about the noise you’ve been staring at. Maybe not, maybe you’ve found something real—but, if so, you’re not learning it from the p-value or from the hypothesis tests.

A non-rejection, though: this tells you something. It tells you that your study is noisy, that you don’t have enough information in your study to identify what you care about—even if the study is done perfectly, even if measurements are unbiased and your sample is representative of your population, etc. That can be some useful knowledge, it means you’re off the hook trying to explain some pattern that might just be noise.

It doesn’t mean your theory is wrong—maybe subliminal smiley faces really do “punch a hole in democratic theory” by having a big influence on political attitudes; maybe people really do react different to himmicanes than to hurricanes; maybe people really do prefer the smell of people with similar political ideologies. Indeed, any of these theories could have been true even before the studies were conducted on these topics—and there’s nothing wrong with doing some research to understand a hypothesis better. My point here is that the large standard errors tell us that these theories are not well tested by these studies; the measurements (speaking very generally of an entire study as a measuring instrument) are too crude for their intended purposes. That’s fine, it can motivate future research.

Anyway, my point is that standard errors, statistical significance, confidence intervals, and hypotheses tests are far from useless. In many settings they can give us a clue that our measurements are too noisy to learn much from. That’s a good thing to know. A key part of science is to learn what we don’t know.

Hey, kids: Embrace variation and accept uncertainty.

P.S. I just remembered an example that demonstrates this point, it’s in chapter 2 of ARM and is briefly summarized on page 70 of this paper.

In that example (looking at possible election fraud), a rejection of the null hypothesis would not imply fraud, not at all. But we do learn from the non-rejection of the null hyp; we learn that there’s no evidence for fraud in the particular data pattern being questioned.

35 thoughts on “What hypothesis testing is all about. (Hint: It’s not what you think.)

  1. There are those three possible verdicts in Scottish courts: Guilty, Not Guilty, and Not Proven. They seem pretty applicable here too.

    • Tom:

      In science, it’s different because the null hypotheses are just about always false. A hypothesis test summarizes how consistent a particular data pattern is with respect to the null hypothesis: non-rejection tells you that the null hypothesis adequately explains this aspect of the data. But the null hypothesis is still false. Which is one reason why rejection of the null hypothesis is not as noteworthy as people often seem to think.

      • Andrew:

        “non-rejection tells you that the null hypothesis adequately explains this aspect of the data”: I think non-rejection does not tell us that the null hypothesis is more compatible with the data than any other possible true effect as indicated, e.g., by the values covered by a confidence interval. We should probably make sure that “adequately” is not understood as “sufficiently”.

        Also, with regard to your post from 2015, it is possible that a null hypothesis is not rejected but the effect is measured with high precision and errors are small. This may happen if the effect size is close to zero. I think the p-value does not tell us much about how noisy our experiment is unless we look at the error bars. This is one reason why the reject / non-reject dichotomy should be abandoned, no?

      • I’ve said it before, that there is nothing special about “the” null hypothesis. You can calculate p-values for any hypothesis you want (and have the same issues about what they may mean).

        In the end, we really want to know how well-tested something is. That’s where the “not-proven” aspect would come in.

    • Tom: I agree that Scots law’s third verdict of “Not Proven” is a really useful concept when describing the difference between hypothesis and significance tests – I often use it when teaching. It also helped me understand what was going on in this work.

      But even armed with the idea of a third “say nothing” option beyond finding guilt or innocence, one is still stuck with the problem that essentially all statistical tests do dichotomize, somehow. With significance tests the verdicts are basically “Guilty” versus “Either not guilty or not proven, and we’re not saying which”. With hypothesis tests you get “Guilty” versus “Not Guilty” with no third option.

      Trying to get more nuanced decisions is hard – you have to say in hypothesis tests when you’d accept the null (guilty) versus not say anything (not proven) and the tools for doing that are slippery, at best. The effort trying to use and explain that nuance is often better spent thinking in terms of estimation instead – as I’m sure Andrew would agree.

      PS A very old joke from Scottish lawyers defines “Not Proven” as “Not Guilty… and don’t do it again”

      PPS Italian courts have a bunch more verdicts

  2. >But the null hypothesis is still false

    Perhaps that is true in political science. In product development for physical products, the “null” is the current, often highly optimized, product. It is usually the best, so that the one-sided null is typically true. Additionally an “improvement” has to survive multiple replications over different tests and study populations. The failure rate for modifications is quite high (over 90% in some areas).

    • Reminds me of this from a few years ago though: https://www.ncbi.nlm.nih.gov/pubmed/26307858

      You start relying on superiority trials and leave out placebo, then it can turn out the original studies were flawed or something has changed and both treatments are now “worse” than placebo according to your metric. Then there are, of course, the usual issues with choosing the measurement you are using to determine what is “best”.

      • One quickly learns not to leave out the current controls, particularly when dealing with convenience samples, rapidly changing products, and replication, replication, replication. Doing so can lead to some spectacular failures during product launches.

    • > so that the one-sided null is typically true

      Bill: isn’t the null hypothesis always an exact hypothesis (at least in Fisher’s NHST framework)?

      The p-values of your one-sided tests (Pr(T(y^rep) > T(y) | H0 is true)) may always exceed .01 or .05. Andrew is saying that the null is false anyway (Pr(H0 is true) = 0). I don’t see a contradiction.

      • In the practical problems to which I refer, the “null” product is the current working product/process and my objective is to improvement its performance, so its a dividing or separating null. If something noticeably fails, one does an autopsy to see why (if its not obvious). When I was interested in a point null then it was an equivalence testing problem.

        The difference is in claiming all nulls are false. Some are. Maybe they all are in Political Science, but not in all other areas.

    • “the “null” is the current, often highly optimized, product”
      The null is a probability model, not a physical product. The model is an idealisation, therefore can’t be true (nothing in the world is truly independent of anything else etc.).

      • There are hypothesis that can be true, for example “Madison wrote Paper 63” or “this skull fragment is Hitler’s”. However, it seems that you consider null hypothesis are not statements about the world but confined within a model. Under this too restrictive interpretation you’re right, I guess: models are not real so nothing within a model can be real.

        • “Under this too restrictive interpretation you’re right,”
          How is this too restrictive? We’re talking here about statistical hypothesis tests, and these test statistical models, not general research hypotheses.

        • It seems other people think that “product A is better than product B” is a perfectly valid null hypothesis (for some definition of product and better). Can you imagine some extension of your interpretation allowing for the hypothesis to refer to a real property of the world? Then, is in that sense that your interpretation is too restrictive.

        • Well, I’m fine with idealisation, so obviously a certain H0 can be idealistically interpreted as formalising that “product A is not worse than product B”. What I don’t like is that people forget that idealisation takes place, and that in fact the probability model formalises something that is far more restrictive than this. Let’s say you measure the quality of the products by certain measurements. Before statistics even starts, one can discuss about the appropriateness of this measurement but let’s take it for granted for the moment. Now let’s say it’s taken on 100 products A and 100 products B and average values are 24.5 and 22.3 and a one-sided two-sample t-test (H0: expected value for A >= expected value for B) gives you p=0.747. The H0 says that the measurements for A and B were iid normally distributed, which in fact they weren’t (for starters, I hadn’t told you that measurements never have more than one digit after the decimal point which is impossible under a normal, but also the production process may make us think that all these products are pretty much independent but chances are one can find some tiny dependencies when looking hard for them).

          The statistical H0 is not true, but in terms of the statistic of interest, the mean quality measurement, the data seem perfectly compatible with the H0 *formalising the real research hypothesis of interest* and I’m fine with that unless the data give me reason to think that the model assumption was violated not only somehow but *in a way affecting the conclusion*, say most B measurements are higher than those for product A but a few outliers dragged the B mean down.

          All this doesn’t allow a positive statement like “we know for sure that A is better than B now”, but if now somebody believes that indeed product A is better than product B (without believing that the statistical H0 was perfectly true), I will not stop this person.

          Obviously a Bayesian approach could give the person a probability that the H0 is true, but it has the idealisation problem, too, things are modelled as exchangeable but in reality one may not precisely believe this; measurements are discrete but we may use a continuous model; outliers will affect something based on normality etc. It’s a model, and models are not there for being literally true or “believed”.

        • This gets back to a discussion on another recent thread: That the model is a representation of information. In particular, the (statistical) null hypothesis as stated in a model is often a representation (within the model) of a null hypothesis in the real world situation that is being studied (by using a model).

      • Hello Christian,

        As a mathematics professor your null may be a probability model. As a product developer the “null” is the product I’m selling right now; the probability framework is something that get gets added later in the process (and the notion of repeated sampling is really kind of weird.) Typically, it comes from the trial design and randomization stage. Product development and marketing tends to start with the physical world and reach into the mathematical toolboxes as needed.

        The objective is usually to make a real-world decision. It is closer in spirit to Bechhoffer’s Ranking and Selection ideas.

        • Bill:

          Statements such as, “The new product is better [under some clear definition] than the old” can indeed be true or false. But I don’t think the statistical tool called “null hypothesis significance testing” is a good way to attack these problems. If that’s the only tool available, fine, but we can do a lot better than hypothesis tests and p-values, if the goal is to make real-world decisions.

        • Andrew,

          I agree, there are better tools that may apply. There are practical considerations that need to be factored in though. When a product “improvement” is in development, it is constantly changing (as are the data sources.) That limits the utility of intervals and such. The developers are highly invested in the success of their product so their priors are not disinterested (and are usually wrong.) Utility functions would be great if they were known, but they change as the product changes. Also, they are tightly guarded by the development teams when they do exist. The internal competition is far more intense than the external competition, as having your program axed can mean you get axed, too.

          The best success if had was with the Ranking and Selection approaches. However that only appealed to a small segment.

        • Bill:

          If there is as much noise in the data as you described above — How does “rank” control for that noise better than statistical modeling?

        • Curious,

          The measures aren’t that noisy, if one sticks with person based counts (e.g. Stratified Cox or logistic models). They are essentially discrete choice models. Ratings (1-7 how much did you like it) are quite unreliable.

          Ranking and Selection procedures drop the whole hypothesis testing approach, as well as p-values. Instead you set up an indifference zone to see the trial and pick the winner(s). There are two variations: either pick the winner (indifference zone approaches) or pick a smaller subset (subset selections) It works well when there are multiple variations being tested. (The only time one is in a NP two hypotheses situation is on final release trials, otherwise there are always multiple alternatives) As Andrew noted the pure H0 and p-value approach doesn’t really cover it.

        • Bill:

          Ok. I think I misunderstood what you meant by Ranking and Selection. I agree with your comments about discrete choice as compared to ratings.

        • Bill,

          Another thing I wanted to mention.

          My only experience with non-parametric approaches is in studying them and not in applying them, so my position is possibly biased by this fact. The reason I never used them was that I could not get past the idea that something could or should be distribution free and also be robust as a predictive method of future behavior. As a way of making a decision today that had to be made, sure. But as a way to make the best decision that will be most predictive over time, it seemed to lack any rationale that would allow for that. It also seemed a sure fire method to ignore or miss potential confounds that result in fooling ourselves about which feature or set of features is actually most likely to result in the greatest amount of future revenue.

        • Curious,

          The end point of the product development process is repetative buy/don’t buy choices by consumers, so a non-parametric approach is a natural fit. I could forecast market share and volume changes quite accurately using simple non-parametric models and commonly available market structure information (see Ehrenberg and Chatfield on repeat buying for more info)

        • “Statements such as, “The new product is better [under some clear definition] than the old” can indeed be true or false. But I don’t think the statistical tool called “null hypothesis significance testing” is a good way to attack these problems. If that’s the only tool available, fine, but we can do a lot better than hypothesis tests and p-values, if the goal is to make real-world decisions.”

          This begs the question of what other statistical methods (whether Bayesian or frequentist) exist other than hypothesis tests and p-values to attack the problem “X is better [under some definition] than Y”.

        • Christian,

          I agree that the statistical/probability model is an idealization, and one should check its fit. Experience has taught me to avoid parametric assumptions for the soft (psychology) parts of studies and I simplify by testing within person and treating persons as a strata (usually fixed). The CPG products I was involved with are used repetitively, so it was reasonable to switch test products either across days. The order was randomized across persons (BIB or PBIBs), which supplies a simple randomization basis for a model. The ratings simply served as a way to order within person for discrete choice models, which is what we actually are interested in (will the consumer buy it again after using it or not).

    • On the other hand, the probability that your experiment would give the actual, exact result is basically nil, at least for a continuous variable. So you can be almost certain that the “true” result is not the value that you measured.

      Well, that’s the trouble with point measurements. Better to stop thinking in terms of point values … and point hypotheses then go away too.

Leave a Reply to Andrew Cancel reply

Your email address will not be published. Required fields are marked *