## “Why continue to teach and use hypothesis testing?”

Greg Werbin points us to an online discussion of the following question:

Why continue to teach and use hypothesis testing (with all its difficult concepts and which are among the most statistical sins) for problems where there is an interval estimator (confidence, bootstrap, credibility or whatever)? What is the best explanation (if any) to be given to students? only tradition?

I won’t attempt to answer this question but I will comment on the replies. Notably to me, none of the replies said anything about controlling Type 1 error rates or anything else. Rather, the main defense of hypothesis testing were not defenses of hypothesis testing at all, but defenses of decision analysis.

This is interesting because in Bayesian inference, decision analysis comes automatically (I’d say “pretty much for free” but that’s not quite right because it can take effort to define a reasonable utility function. You could say that this is effort worth taking, and I’d pretty much agree with that, but it is effort.) so it doesn’t need a any special name. To do Bayesian decision analysis you don’t need any null and alternative hypotheses, you just lay out the costs and benefits and go from there.

But, for people with classical training, “hypothesis testing” is a thing. And I agree that, if all you have is interval estimation, you need to take some other step to get to decision analysis.

Some of the discussants to this post did discuss Bayesian inference so it’s not like I’m saying that my above thoughts represent some deep new idea. My point here is that I’ve typically taken hypothesis testing at face value (as a way of evaluating the evidence against a null hypothesis), but I suppose that for many people, hypothesis testing is the default statistical tool for decision analysis. Scary thoughts.

1. Noah Motion says:

There’s no link to the discussion in question, but one fairly obvious reason to teach it is that there is an enormous body of literature in which hypothesis testing features very prominently. Given that the logic of hypothesis testing is not so simple, if people are going to read whatever subset of that literature, they should know what the reported statistics mean, even if they don’t use hypothesis testing in their own research.

Of course, there are also people (e.g., Mayo, Spanos) who argue for hypothesis testing and invoke error rates to justify it.

• Martha says:

Noah’s point about the “enormous body of literature in which hypothesis testing features very prominently” cannot be ignored. But in teaching for reading that body of literature, we need to emphasize the cautions needed in interpreting hypothesis tests. (For an attempt at this when I recently had only one hour as a guest lecturer, see http://www.ma.utexas.edu/users/mks/Cautions2014/cautions2014.html. However, the course did also cover some Bayesian statistics, so students were at least made aware of an alternative.)

• Martha says:

Most (if not all) of the examples in the first reply in the link are examples where in fact the frequentist hypothesis test paradigm is appropriate: When the same test is repeatedly applied in essentially different instances of the same situation — in particular, in “quality assurance” situations, where one can meaningfully interpret Type I and Type II errors (analogously to false negative and false negative results in medical testing).

But the frequentist paradigm does not make as much sense in situations which are more “one off” rather than frequentist in nature. And these occur a lot — in the various psychological studies where frequentist hypothesis testing is so often used, as well as in a lot of biological applications where one is trying to figure out what has happened, or in other places where the “possible states of nature” perspective fits better than the “repeated sampling” perspective. When the “states of nature” paradigm fits better, then Bayesian methods fit better — because, for example, one is really interested in the probability distribution (given the data) of a parameter.

• Keith O'Rourke says:

> But the frequentist paradigm does not make as much sense in situations which are more “one off”

CS Peirce used the example of playing a card game with the devil, loss resulting in damnation (very one off).
He argued that for an individual the rationale was missing but not for a community – if a community has to play with the devil, frequencies are very relevant. As a for instance, Don Berry argues for type one and two error evaluation when using Bayesian approaches in regulatory review (e.g. FDA submissions).

In this post, it is the community (that will be taught) and how well they then do – is at issue.

This has to be an empirical question. It has been empirically demonstrated that humans can actually do better with sub-optimal approaches given human information processing limitations. So we will need a community taught Bayes versus one taught frequentist comparison of later _success_ rates in their chosen fields.

Giving some thought to what they need to be taught to do Bayes _well_ in applications would be worthwhile. A recent survey of statisticians working in pharma (I think) found that most were unable to implement worthwhile Bayesian analyses or were unwilling to try (even some who had taken Bayes course work.

> one is really interested in the probability distribution (given the data) of a parameter.
If and only if it can be credibly obtained (e.g.calibrated).

2. Simon Gates says:

“…but I suppose that for many people, hypothesis testing is the default statistical tool for decision analysis.”

It’s worse than that – for many people, hypothesis testing is statistics.

3. Anonymous says:

The usual suspects are claiming it just hasn’t been taught correctly:
http://rejectedpostsofdmayo.com/2015/01/03/why-are-hypothesis-tests-often-poorly-explained-as-an-idiots-guide/

Back when people ate dirt, physicists started conducting experiments on “static cling” and loadstones. In a similar amount of time it took classical hypothesis testing to go from a major advance/paragon of the “scientific” statics to a proven practical disaster, those physicists mastered electricity, magnetism, optics, and electromagnetic waves thereby creating the modern world.

Now imagine if some Electrical Engineer today made a smart phone that didn’t work, and all the physicists chimed in with: “the equations of electrodynamics are correct, we just haven’t figured out how to teach them yet. Give us another century and we’ll teach those equations right and our phones will work”. The Physicists would get fired.

It’s just the saddest most ridiculous excuse ever that Frequentists who completely dominated the teaching of statistics for the better part of a century, whose hero’s literally wrote the book on the subject, weren’t given enough opportunity to teach it right. They had every opportunity imaginable and couldn’t do it.

• Anonymous says:

Why couldn’t they teach it correctly? Because the universal equating of “frequencies” with “probabilities” is a mistake. When you make that mistake you start to think all kinds of things make sense which simply aren’t true. So no matter how hard they try pedagogically, people keep getting wrong answers.

Now I’m sure Frequentist will chime in and say “we just need to make XYZ ad-hoc adjustment and mistakes ABC will be corrected”. Just like you can correct the theory of Epicycles which ad-hoc fixes, or a stock trader can make their loosing trading strategy appear to be a winner with ad-hoc fixes.

That doesn’t make the resulting theory right and it certainly doesn’t make it useful. To make the disaster of Frequentism complete, there is overwhelming theoretical and practical evidence that if they perfected their methods with enough ad-hoc corrections they’d just get the Bayesian result.

• Noah Motion says:

I think there’s a lot of credibility to the claim that it’s often not taught correctly. The stats classes I took in a stats department were very good, and in each of them, the material was presented rigorously. Some of the stats classes I’ve taken outside of a stats department were excellent (e.g., John Kruschke’s first-year psych stats class), others were decidedly less so (I’ll not name names here). Similarly, the stats and methods textbooks I’ve seen vary substantially, with some being quite careful to present hypothesis testing (and other topics) correctly and others making a mess of it. In the latter category is, unfortunately, the undergrad research methods textbook I’m stuck with right now…

Another important issue, I think, is the fact that the people who can’t grasp the equations in physics stop doing physics, whereas large numbers of people in various social science fields will continue to pursue whatever subfield they’re in more or less regardless of their ability to really dig into statistics. Some of the people that make it through studying physics end up teaching it later on, while an unfortunately large number of people with relatively little interest and relatively little skill in stats end up teaching stats to up-and-coming social scientists.

Finally, how hypothesis testing is taught is, of course, largely distinct from whatever good or bad properties hypothesis testing has as a scientific tool.

• Anonymous says:

I’m not denying you and I were often taught crap. What I’m saying is Frequentists either need to admit they got far deeper problems than poor pedogogy, or they need to be fired immediatly for being the world’s most insanely bad teachers. There are thousands of topics succesfully taught to non-physicists every year without issue that are much harder than hypothesis testing. Maxwell’s equations of electrodyanmics for one.

Frequentists shouldn’t be allowed to paper of their failures with “it was taught bad”. They invented it, they taught it, they rammed it down everyones throat. They should face up to their failures.

• Anonymous says:

A hard subject like Electrodynamics is easy to teach well because Physicists have a deep understanding of it. “Deep understanding”=”good pedagogy”.

An easy subject like hypothesis testing is hard to teach well because Frequentists do not have a deep understanding of the probabilities. They thought they did, but they don’t. “failure to understand”=”poor pedagogy”.

4. Elin says:

Well, I hate to get all sociological on you, but people teach hypothesis testing because:

because that is what is in the text books;

because that’s what students are expected to know when they go to other classes or read journal articles or get a job with some data analysis involved;

because that’s how the faculty member was trained;

because it is what everyone else does;

because it is, despite the backflip double twists that you have to do to explain the logic of testing (e.g. using estimates of the population variance based on the sample) the idea of a null hypothesis is conceptually understandable to students.

5. I think a lot of frequentist statistics can be very useful, but like so much complicated information, the key is intelligent interpretation. The famous saying is, liars, damn liars, and statisticians, but really this comment on the importance of smart interpretation applies to anything complicated, like purely theoretical models too – oy, as we really see in macroeconomics today.

That said, here I normally like confidence intervals much much better than the t-stats or p-values of a hypothesis test. They’re much less likely to mislead into an unintelligent interpretation to reality, and you quickly understand the situation much better and more precisely. And the problem of mistaking econimic significance for statistical significance pretty much can’t happen.

Sadly (and puzzlingly) confidence intervals are given a small fraction as much as p-values/t-stats in economics and finance research papers. I usually just estimate them in my head.

As far as utility/loss functions and prior functions, a big problem is that when you try to put the actual, in reality, utility/loss function or prior into a manageable mathematical equation, you have to massively simplify and approximate the complicated messy reality. So this simplification, which is often enormous, can lead to being way off, really misleading. So again, the great importance of smart interpretation to reality. You just don’t take the Bayesian result literally, you have to interpret it smartly, to ask how might the reality differ from the utility/loss function, or prior, I used, and how might that affect the results in material ways. How should I compensate for that in my interpretation to reality, using my flexible high-level intelligence.

So essentially, whether you use Bayesian or frequentist, you’re still going to have to use your flexible high-level intelligence in your interpretation to reality, if you really want to make intelligent and effective conclusions to reality, and avoid saying some really silly and/or harmful things.

• Corey says:

I believe the saying is actually “There are three kinds of lies: lies, damned lies, and statistics.”

6. Lewis says:

Slightly off topic, but this is a thread about hypothesis testing, so I wanted to bring this to your attention Andrew — another prime candidate for your push to get the NYT to correct very basic factual errors. An op-ed from this morning’s Times:

http://www.nytimes.com/2015/01/04/opinion/sunday/playing-dumb-on-climate-change.html?ref=opinion

The author (not a journalist, but a Harvard history of science professor), in an attempt to explain “correlation does not equal causation,” goes ahead and makes the correlation equals causation mistake herself and completely mangles her explanation of confidence intervals and hypothesis testing in the process: she actually argues that the purpose of a hypothesis test is that once a correlation passes the 95% threshold, it can be interpreted as evidence of a causal relationship, and then goes down further from there. Maybe worth a post…

• Robert Grant says:

Not only that, but odds and risk get mixed up in that article too.

• Keith O'Rourke says:

Robert:

Nathan Schachtman wrote the post and I think NormalDeviate nailed the real issue “you [not Nathan in particular] missed my point which was in the following sentence. She (the NY times writer) was confusing a sampling issue (significance) with the problem of causation vs correlation.”

I tried to amplify it

> [Nathan] “In stead, the problem with this epidemiologic example is the internal and external validity of the studies involved, substantial bias and confounding concerns”

> [Nathan] “I am not terribly familiar with the models or the data in climate science.”

I will agree with the above and the below

> [Nathan] “idealized model never exists in observational epidemiology, but there are papers with well-controlled logistic regressions, or with propensity score controls, which require us to take the reported associations seriously”

with the caveat rarely in my experience to reducing the uncertainty to that which is quantified in type 1 and 2 errors.

So why not make folks clearly aware of this?

There is also an interesting discussion between Nathan and Sander Greenland on expert witnesses in litigation that I think is well worth reading.

• Martha says:

The following item posted on ASA Connect by Steve Pierson of ASA is relevant to Lewis’s comment:

“Ron Wasserstein’s recent blog entry on http://www.Stats.org explains the difference between odds and probability (www.stats.org/?p=1442). While the audience could be anyone, the column focuses on reporters. We’d like to have additional pieces aimed towards improving the accuracy of writing about statistical concepts.

Do you have any ideas for such a column? Have you recently seen statistical misperceptions or errors in the media? If so, please share them with me”

7. “This is interesting because in Bayesian inference, decision analysis comes automatically (I’d say “pretty much for free” but that’s not quite right because it can take effort to define a reasonable utility function. You could say that this is effort worth taking, and I’d pretty much agree with that, but it is effort.) so it doesn’t need a any special name. To do Bayesian decision analysis you don’t need any null and alternative hypotheses, you just lay out the costs and benefits and go from there.”

In decision analysis, as you mention, one has to “lay out the costs and benefits and go from there”. It’s free *after* that work is done, so it’s not really free (and you have yourself pointed out earlier that you never said it was easy). The crux of the problem is that there seems to be no way to define costs and benefits without setting conventions that everyone would agree on, just like the alpha of 0.05.

I think you are referring to loss functions like these:

Let Hypothesis be: $\theta \in T$. We can out the posterior probability of H being true given the data: $P(H\mid x)=\int f(\theta\mid x)\, d\theta$.

The loss function is now:

\begin{table}[ht]
\centering
\begin{tabular}{rrr}
\hline
Decision & H true & H false \\
$d_1$ H true & $0$ & $l_1$ \\
$d_2$ H false & $l_2$ & $0$ \\
\hline
\end{tabular}
\end{table}

This leads us to the statements of the expected utilities:

a. Accept H: $p\times 0 + (1-p)l_1 = (1-p)l_1$

b. Reject H: $p\times l_2 + (1-p)\times 0 = (1-p)l_1$

If a<b, accept H. I.e.,

If $(a)<(b)$, accept H. I.e., accept if:

(1-p)l_1 \frac{l_1}{l_1+l_2}

Is this what you’re talking about? If yes, now we just have to think about what costs $l_1$ and $l_2$ should be. If I decide that $l_1=0.95$ and $l_2=0.05$, then my posterior probability has to be greater than 0.95.

Is this a reasonable way to proceed?

• Andrew says:

Shravan:

I’m glad you asked this because it indicates a complete failure of communication. When I talk about decision analysis, I’m talking about actual decisions, for example whether to measure your house for radon, or whether to perform a certain medical test, or how much to pay people to participate in a survey. (These are the three examples in the decision analysis chapter of Bayesian Data Analysis.)

When I talk about the sort of decision analysis I would do or recommend, it is always about actual decisions, never never never never about accepting or rejecting hypotheses, about true and false hypotheses, and all that crap. I see all that as a valiant but hopelessly flawed approach to handling uncertainty.

• Bill Jefferys says:

I agree with Andrew. Decision analysis is about making decisions to take some concrete action, given a choice of several actions that one may take. Simply saying “I think the null hypothesis is true (or false),” although it may be a sort of “action,” is pretty vacuous. The real question is, what are you going to DO?

Also, in decision analysis there may be (and often are) more than two actions that are under consideration. As for example, a jury (hypothetically) using decision analysis to decide what to do in a capital case. One decision could be to acquit; the defendant goes free, but if the defendant was actually guilty, there is a potential loss if he reoffends. Another could be to convict and sentence the defendant to life in prison (and “with or without possibility of parole” offers two different actions), with a potential loss of unjustly convicting someone who was actually innocent (and the guilty party remains at large and may commit another crime). Or, the jury could decide to convict and sentence the defendant to death (even greater injustice if the defendant is actually innocent, and possibly a greater potential loss if the actual criminal commits another crime, which may be a greater crime if the jury thinks that the crime was heinous enough to hand down the death penalty.

[This last for those who live in death penalty states. I’m fortunate to have escaped Texas and moved to Vermont when I retired. In Texas when I posed this to classes, occasionally the class would arrive at loss functions that could result in the death penalty. That has not happened in similar classes I’ve taught at the University of Vermont. I suspect cultural differences may have influenced this.]

It’s true that no actual jury is going to use decision theory to come to a verdict; nonetheless, a conscientious jury is surely going to do an informal analysis that considers the harm that may be done by making the decision amongst several choices (e.g., acquit, convict with penalty A, convict with penalty B,…) under different states of nature (e.g., guilty, innocent).

And I would argue that real-world decision problems often have more than two actions and/or two states of nature. So, in my opinion, the strictly binary hypothesis-testing model is hopelessly inadequate.

• Well, I see what you’re saying and it all makes sense. But it has nothing to do with what psychologists, linguists, and suchlike people are doing. We are really studying specific research hypotheses. What we are going to “do” is not commit someone to death. At most the action consists of saying that we believe that X holds, where X is some hypothesis. Why does an analysis have to lead to doing something physically?

My point is that when one is criticizing the hypothesis testing community, so to speak, you at least have to address what they are trying to achieve. That doesn’t seem to be the case here.

• Andrew says:

Shravan:

I agree with you that no physical action is needed. I just don’t see the need to place scientific learning in a hypothesis-testing framework. I’ve done a lot of applied research in social and environmental sciences, and I’ve evaluated a lot of hypotheses, and it just about never comes up that I want to formally say that I believe that X holds.

• Maybe you should work with me on a research problem some day. I’ll bring you to the moment where you have to make a commitment as to whether X holds ;)

• Bill Jefferys says:

I would say that the actions actually involve doing more than saying that you think X holds or does not hold.

Suppose you come to such a conclusion. What are you going to do with it?

Presumably, you are going to publish (or not publish) a paper, at least.

Which in itself involves losses/utilities.

If you publish a paper that turns out not to be supported when investigated by others, as happens much too often, as you know, (see the famous PNAS paper by John Ioannidis, for example, and many others), then you may end up with egg on your face. This might not enhance your promotion chances, or your chances of getting funding for a new project.

But if you don’t publish and it turns out that your conclusion would have been correct (supposing that the hypothesis gets investigated by others at a later date) then you’ll miss out on getting credit for your research.

Both of these are losses.

They have to be balanced by estimating the loss (or utility) of publishing something that is correct.

Furthermore, the loss/utility will depend on the hypothesis being tested. How important is it to study the color of dresses worn by women at different stages in their ovulatory cycle? Does establishing an unusual and important result (like showing that a new drug is actually effective against a particular condition with minor side effects) rank as more important than learning that women wear red more often at particular points in their cycle? Are the losses/utilities going to be greater or less in the drug situation than in the red dress situation? Which hypothesis has the greater chance of getting you elected to the National Academy, for example, if you publish it, or of significantly damaging your reputation if you publish what later turns out to be a wrong conclusion?

So I conclude that this fixation on p<0.05 (or 0.01 or any other conventional number) is entirely misplaced since it ignores losses/utilities that are obviously there, at least to my eyes.

• psyoskeptic says:

The problem is that there is virtually no cost at all to publishing a finding that isn’t replicable as long as the original finding is supported with p < 0.05. The decisions is always publish because the loss is almost always 0.

8. Thanks. But Andrew, then you are talking at cross-purposes. The people you are criticizing (or generally criticize) do planned (OK, maybe post-hoc) comparisons to test whether \theta \in T, where T is some range of values. This is what a huge number of people do in their research. What should these people do if not hypothesis testing? Their main business is testing hypotheses. What is a non-flawed approach to handling uncertainty for these people?

• Andrew says:

Shravan:

You write, “their main business is testing hypotheses.” I disagree. Their main business is doing science, or engineering. For example, those ovulation researchers are interested in connections between the ovulatory cycle and behavior. For this purpose, I think they should do within-person studies with measurements that are as accurate as possible. They can summarize their inferences in various ways, but I don’t think they’ll go far if they view their business as testing the null hypothesis that ovulation has zero correlation with behavior.

In other settings, researchers are actively interested in making a decision: for example, how much to spend on different sorts of advertising, or how large a sample size they want for a new survey. In this case I think they should balance costs and benefits, perhaps formally using a probabilistic decision analysis or informally by setting various goals.

Virtually never do I see it as an appropriate research goal to declare, “Hypothesis A is correct” or “Hypothesis B is correct.”

• “Virtually never do I see it as an appropriate research goal to declare, “Hypothesis A is correct” or “Hypothesis B is correct.””

Here is an example where the goal is to declare hypothesis A or B as correct:

When we fixate on a word during reading, do we, through parafoveal processing, acquire information (beyond low-lexical information) about the next word (that we haven’t fixated on yet)? Reading research in psychology has tried to find evidence for hypothesis A (no, we don’t acquire high level information parafoveally; basically arguing for the null) and opponents of that view have argued for hypothesis B (yes, we do acquire information parafoveally).

Whether A or B is correct has implications for the architecture of the eye-movement control system. You would probably agree that building that architecture is doing science. But doing that science presupposes having figured out whether A or B is correct.

Even in the ovulation and wearing red study, what if they had done the study right, with a within subjects design, and with better measurement of ovulation etc? I am not defending their use of null hypothesis testing. Suppose they fit a Bayesian hierarchical linear model to their data. They still have to make a decision about whether their hypothesis (that ovulating women tend to wear red) is supported or not. They are doing science, and the advance in scientific understanding depends on whether this hypothesis is supported or not.

• Andrew says:

Shravan:

I’m on shaky ground talking with you about your own research, so let me just say something like, if I were studying this, I’d try to estimate the amount of information acquired in that way. I have no objection to doing a study and concluding that, yes, the null hypothesis could explain what you see. Or concluding that there are aspects of the data that don’t seem to be explainable by the null hypothesis. This is model checking and I have no problem with it. I don’t think it’s a matter of saying A or B is correct; rather, it’s possibly a matter of saying that model A is sufficient to explain the data that we see.

For the ovulation study, the point is that of course behavior varies throughout the cycle, and this is worth studying. Behavior also varies by day of week, day of month, etc etc etc. There are lots of things that can be studied here. I don’t see “ovulating women wear red” as a particularly interesting hypotheses—and apparently the researchers in that field don’t think it’s so interesting either (I say this because they seem to have expressed no interest in actually looking at days of ovulation or peak fecundity as agreed upon by the medical community). But that’s ok, the point is that any pattern here is potentially interesting. For that example I think it’s a big mistake to take a general question (about ovulation and behavior, or fecundity and clothing choice or whatever) and frame it uninterestingly in terms of one specific comparison.

• Martha says:

This discussion between Shravan and Andrew brings to mind something that had slipped to the back of my mind: A few months ago, I came across a post entitled “the simpleminded & the muddleheaded” on Simine Vazire’s blog (http://sometimesimwrong.typepad.com/wrong/2014/07/the-simpleminded-and-the-muddleheaded.html). The title refers to a distinction made by Whitehead and Russell but picked up by Meehl, quoted as
“The two opposite errors to which psychologists, especially clinical psychologists, are tempted are the simpleminded and the muddleheaded”

Vazine gave her own take on the distinction as
“another way to describe these groups is that the simpleminded are terrified of type I error while the muddleheaded are terrified of type II error. pick your paranoia,”
elaborating with discussion including,
“one thing that has struck me as i’ve observed discussions is that people at both extremes have incredibly strong intuitions about which is the bigger problem.”

After more discussion, she said,
“so, now i come to a question i can’t answer: what would the muddleheaded think is a fair test of their intuitions, and what evidence would cause them to reconsider those assumptions?”

There were numerous comments, but none seemed to address this question until one commenter pointed this out, adding,
“this is, to me, the fundamental question that each side has to answer. we should make our beliefs/intuitions falsifiable, by making concrete empirical predictions about what the world would look like if our intuitions are correct.”
This pretty much seemed to stop the conversation.

I pondered making a comment – but didn’t really see how I could add to the discussion, since the whole idea of having strong intuitions/beliefs of the types discussed and thinking about what might “falsify” any beliefs I might have was a framework that I didn’t know how to fit my thinking into.

The point, I guess, is that so often when I read things by psychologists, they just seem to have a different way of thinking than I do. As a mathematician, I didn’t start out with, “I believe this and am trying to prove it is true.” Instead, I investigated to try to figure out what could be established by rigorous proof and what could be refuted by counterexample.

I approach science and statistics in the same manner – except that the “evidence” cannot be expected to be as strong as mathematical proof or counterexample, so I often have to leave things at “plausible,” or “consistent with the evidence,” or “I really have no idea.” But I’m not out to prove or disprove anything in particular (nor to substantiate or discredit). It’s just a matter of trying to figure out what is and is not true – and to remain uncertain to the extent that the evidence warrants.

• I want to repeat that my comments are not about using null hypothesis testing. I myself use posterior distributions to think about what conclusions the data lead me to. I still don’t know what changes for me if I abandon null hypothesis significance testing.

Martha, I think you underestimate psychologists and the like. You’re attributing to them a lack of sophistication that I just don’t see in general (there are certainly specific cases). I don’t know anyone (except maybe beginning students) whose work would fit your mischaracterization below:

“I approach science and statistics in the same manner – except that the “evidence” cannot be expected to be as strong as mathematical proof or counterexample, so I often have to leave things at “plausible,” or “consistent with the evidence,” or “I really have no idea.” But I’m not out to prove or disprove anything in particular (nor to substantiate or discredit).”

The conclusions from a study in these non-math but experimental fields really *are* about leaving things as “plausible”, “consistent with the evidence”. Work with conclusions like “I have no idea” usually don’t get published (a separate problem not under discussion here right now).

I do concede that papers are written with overly strong General Discussion sections. But this has absolutely nothing to do with null hypothesis significance testing, which is the topic of this post.

If the whole field of psychology etc. were to suddenly shift to fitting Bayesian models tomorrow, we would still have that problem. It’s a cultural thing to present a “strong case”, even though one should be properly circumspect and say only that the data are consistent with hypothesis A (or that A is enough to explain the data). This overly strong conclusion is not limited to psychology etc. Physics has it too (there was a discussion on NPR’s SciFri once about how, during a discussion with physicists, Max Planck pulled a theory out of thin air just to avoid the humiliation of admitting that he was wrong about something). In arguments about theoretical positions, people (even physicists) often don’t want to back down from their own position even in the face of data, just because it involves loss of face. Physicists may eventually concede the point more often than psychologists, but the tendency is there.

So, what I take from this discussion is that one should be more circumspect about one’s conclusions than one is. At most say, “hypothesis A is sufficient to explain the data” or something that clearly marks your uncertainty. Fine, that makes sense. But this point has nothing to do with NHST.

• Elin says:

I looked at those articles when you discussed them before and they are pretty clear they are interested in hormone levels, not ovulation, because it is the hormone levels that cause the chimp genitals to turn red. To me the whole problem with that line of study is the idea that somehow something that is purely physical in chimps morphs into something about human decision making through some unspecified bio-psychological model.

• Martha says:

Elin,
Thanks for pointing this out. It explains why they focused on red — but I agree with you that the line of reasoning leading to their hypothesis seems tenuous.

• Anonymous says:

“Whether A or B is correct has implications for the architecture of the eye-movement control system. You would probably agree that building that architecture is doing science. But doing that science presupposes having figured out whether A or B is correct.”

Can you make a direct measurement of the eye-movement control system? What is the implication of A or B being correct with respect to the architecture of the control system. If you could articulate the relationship clearly then you can make progress with a bayesian model.

This is one thing that’s endemic to NHST is that because it considers such systemic relationships piecemeal it’s always one (usually confounded) correlation at a time hand-waved into the bigger “architecture” of the problem and there’s never any real progress on the architecture that’s supposedly motivating (and funding) the research in the first place).

• Kyle C says:

I wish I’d saved that great quote — anyone know what I’m referring to? — by the academic psychologist who wrote that one can make an entire career out of testing and refining and retesting hypotheses about discrete branches of a research paradigm without making any headway on determining the validity of the overall framework.

• Apparently the blog crash caused my comment to be lost. It was fairly long, but one important point was that I doubt very much that anyone believes that *absolutely no* information is acquired by parafoveal processing… I often find myself thinking “gee how far am I through this muddy paragraph?” especially when reading boring material. I believe that I acquire information about the relative position in the page, and therefore whether there exist further words to read, and approximately how many there are… all in “peripheral” vision. So at least I suspect we’re acquiring *existence* information parafoveally.

Which is to say, the real question isn’t “is this quantity zero or nonzero” but rather “approximately how much information do we acquire?”

There are vast fields of science where I think the main thing is really “how the heck can we even approximately measure what we’re interested in” and the hypothesis testing methodology typically employed is totally inappropriate for such questions. The bayesian approach where we’re trying to put weight on different values to constrain the plausible value of an indirect measurement is the only thing that makes sense in this context.

Andrew, I recently watched your talk at the Simons Foundation. Very interesting and recommended to all readers here. I was so shocked however to hear you dismiss (at 47:35) the Type I and II decision-making paradigm with the comment “It doesn’t really work.”

I interpret your words “doesn’t really work” as meaning “it’s always incorrect and never appropriate or a correct approach”. As a humble layperson, I would like to respectfully ask whether perhaps your claim as restated may be too strong?

As a counter-argument, can we please consider diagnostic tests predicated on large-N samples, where the standard of judgment is essentially “preponderance of the evidence” and the goal is to make a statistically valid, clinical binary decision (preliminary positive diagnosis vs. almost-certainly negative)?

Certainly I’ll grant you that advanced Bayesian frameworks appear to provide state-of-the-art decision support. James Simons himself is famously said to have hired IBM’s entire Bayesian team to augment his already-phenomenal research and trading operation.

But for the pedestrian data analysts of the world, including myself, I just don’t see how it “doesn’t really work” to apply a conservative Type I/II methodology? Speaking as a modestly successful algorithmic trader many orders of magnitude below JS’s achievement, Type I / II definitely works for me in the real world of (to keep it simple) deciding to buy or not. I look forward to someday successfully incorporating Bayesian methods into my own little machinery. At present, classical NHST seems to be truly adding value.

In short, I would think that teaching NHST as a “first-order” analytical toolset continues to make sense.

• I think you should reinterpret “it doesn’t really work” as “it doesn’t do what it claims to do” not essentially “it has no value”.

1) Many classical procedures are just Bayesian procedures in disguise with flat priors
2) If you really are repeatedly “playing a game” over and over, the long-term frequency approach can be perfectly valid even though the Bayesian approach *can* be more efficient or effective, even Bayesians are sometimes in the situation where all they have for a “state of knowledge” is a histogram of past results, and so they use a probability distribution that is the same as a (past) frequency distribution, and hope that the future looks like the past..

Daniel, thanks for your comments. I’m sorry, I don’t understand the significance of your distinction between “it doesn’t do what it claims to do” and “it has no value”? Please also see my response to Anonymous below..

• Anonymous says:

In the real world, a strictly type I or strictly type II paradigm has limited value. Why? Because in the real world, people want to know about _the_ error rate (and by extension, the utility), not just the type I or type II. Depending on your base rate of true positives / negatives, a given level of type I error rate can be acceptable or disastrous. However base rates are completely ignored in NHST analyses (or worse, we paper over them under the guise of multiple comparison adjustments).

>In the real world, a strictly type I or strictly type II paradigm has limited value..

I’m sorry, I’m not following you.. Are you describing an artificial, deliberately blinded approach to data analysis? In any case, the pre-Bayesian, 20th century NHST paradigm seems to be given a fair degree of credit for dramatically advancing health and safety standards:

“According to medical historian Harry Marks, the modern controlled clinical trial is largely an American invention as statistically-based clinical trials became a critically important part of evidence-based medicine in the U.S. following WWII.” [1]

“The randomized clinical trial and associated NHST are the mainstays of certain safety and efficacy approaches, such as the FDA drug trials described later..” [2]

[2] “Infant Formula:: Evaluating the Safety of New Ingredients” Accessed via :

• Andrew says:

This has come up a few times on the blog and in my recent articles. In short, I think that classical null-hypothesis-significance-testing methods can work well when studying large, stable effects in the context of small and well-controlled variation (the sorts of problems that Pearson, Fisher, etc., worked on), but not so well when studying small effects that are highly context dependent and with messy measurements.

Hi Andrew, thank you for clarifying. Having followed your blog for a while now, and having read several of your papers, I see that what you just said is what you meant in your Simons talk. With respect, though, I do feel maybe you could have qualified yourself. E.g. “It doesn’t really work in every situation.”

The impression I get is that you and others are trying to promote a paradigm shift slightly analogous to the shift from classical to relativistic mechanics. Yet from today’s context, wouldn’t it sound slightly silly to say “Classical mechanics doesn’t really work”? Yes, maybe not when your subjects are super-massive and/or moving close to the speed of light. But if the problem is hoisting a piano using a compound pully, then the classical equations are indeed fit-for-purpose, IIRC.

Bringing this back to your blog post topic, perhaps there’s something to be learned from Physics pedagogy? Namely, teach the NHST approach as a “classical” first approximation which works well under simple, ideal conditions, as you just stated.

• Anonymous says:

I’m sorry the classical mechanics analogy doesn’t work because classical mechanics is a limiting case of relativistic mechanics (and QM for that matter). It’s behavior is nested within the more detailed models.

Arguably frequentist statistics is nested as a subset of Bayesian statistics, but NHST is not. NHST is a completely orthogonal model in the sense that what it aims to model is modeling is contradictory to Bayesian data analysis. It is only post-hoc where people like Greenland have gone back and figured out coincidental points of agreement out of practicality.

I would add that the only reason NHST has been partially usable for clinical trials is because the base rate probabilities in a clinical trial are somewhat regularized by the process/cost going into them. The application of NHST where that’s not the case (e.g. genomics, large-scale inference, increasingly-larger populaitons of mediocre researchers) made it’s systemic flaws much more obvious.

> NHST is a completely orthogonal model in the sense that what it aims to model is modeling is contradictory to Bayesian data analysis.

Do you consider statistical power to be an element of NHST? If so, I don’t see how it can be completely orthogonal / contradictory to Bayesian logic, given that both express conditional probabilities?

>NHST has been partially usable for clinical trials because the base rate probabilities in a clinical trial are somewhat regularized by the process/cost going into them

Would you mind please saying a bit more about “regularized by process/cost”?

• Anonymous says:

Well Bayesian logic accounts for the base rate, NHST (as it is typically practiced) does not. Even in a power calculation, the calculation is entirely conditional on the null hypothesis being false. Anyway, power is a bit of a separate issue. For, a good illustration of how Bayesian logic and NHST are fundamentally modeling different things, see Kruschke’s visualization here:
http://doingbayesiandataanalysis.blogspot.com/2013/12/icons-for-essence-of-bayesian-and.html

>Would you mind please saying a bit more about “regularized by process/cost”?

Sure, my premise is that the base rate probability of the null hypothesis being false in a RCTs for drugs fall into a restricted range (for example it’s not 1e-7). Why? Because there’s a huge investment just to bring a drug to a phase 3 clinical trial. If drug companies were bringing drugs into phase 3 clinical trials which had a base rate of 1e-10 or 1e-12 they would be going out of business much faster than they are.

Note the same “procedures” applied to things like ‘omics and observational analyses of complex systems such as epidemiology & social sciences have been far less reproducible and productive. Why is this? After all, the NHST procedure is mathematically the same, right? Although there are a multitude of reasons, including confounding, lack of randomization, one big issue is that the difference in base rate means that the same type I error control cutoff has very different error characteristics.

If the base rate of true positives is 1e-9999, it’s insanity to use the same .05 cutoff. Often the base rate is patched up with adhoc “multiple comparison adjustments”, but when you ultimately care about the frequency of being right or wrong, it’s the base rate, not the number of comparisons, that’s the relevant adjustment.

• Andrew says:

One can use prior information in non-Bayesian ways (as in my recent paper with Carlin on type S and type M errors), but it gets clunky and it can take a lot of work.

In my experience, Bayes is much more straightforward and, as Rubin puts it, lets you spend more time on the science. For example, time spent on setting up an informative prior should not be time wasted, as it involves interpreting the literature. And time spent on setting up a decision analysis should not be time wasted as it involves real considerations of costs and benefits. But time spent on setting an appropriate alpha-level or power or loss function, that’s much more indirect and harder to match to generalizable scientific ideas.

Still, I think the value of prior information is so important that I’m willing to put in the effort to express the ideas in non-Bayesian ways, as in my paper with Carlin.

• Keith O'Rourke says:

Anonymous:

Nice clear arguments here.

I would add a reluctance or even taboo against being clear that no experiment can stand mostly on its own but must depend on background knowledge (or as Peirce would put it, we can’t start inquiry from anywhere but where we initially find ourselves).

The arguments are then situated on whether these background considerations are best dealt with using priors (as Andrew argues) or ad hoc maneuvers by really smart knighted statisticians who seem to never fully explain themselves (Fisher?)

(OK could not resist that last tongue in cheek comment.)

10. Sherman Dorn says:

Sometimes you teach something so that students will know what they’re reading when they encounter the literature. For those who are in most Ph.D. programs, they need to be able to read a body of literature with hypothesis testing. So they need to know what it is, right or wrong, commonly abused, p-hacking and all.

On the other hand, it’s an extra cognitive load to say, “Learn this; it’s wrong, but you need to know it,” and then there’s the central question in building a course: what’s truly essential?

• Anonymous says:

‘… it’s an extra cognitive load to say, “Learn this; it’s wrong, but you need to know it,”…’

Stating so outright is better (less cognitive load) than being taught something that doesn’t make sense.

11. […] the problem of hypothesis testing. The dialogue between Vasishth and Gelman particularly crystallises the issue for practising […]

12. […] very important part, validate that outcome against ALL OTHER POSSIBLE ALTERNATIVE HYPOTHESIS.   Hypothesis testing has come under consistent and larger attack as time has gone […]

Where can you find the best CBD products? CBD gummies made with vegan ingredients and CBD oils that are lab tested and 100% organic? Click here.