Someone who wishes to remain anonymous writes in:

Last week, I was looking forward to a blog post titled “Why continue to teach and use hypothesis testing?” I presume that this scheduled post merely became preempted by more timely posts. But I am still interested in reading the exchange that will follow.

My feeling is that we might have strong reservations about the utility of NHST [null hypothesis significance testing], but realize that they aren’t going away anytime soon. So it is important for students to understand what information other folks are trying to convey when they report their p-values, even if we would like to encourage them to use other frameworks (e.g. a fully Bayesian decision theoretic approach) in their own decision making.

So I guess the next question is, what then should we teach about hypothesis testing? What proportion of the time in a one semester upper level course in Mathematical Statistics should be spent on the theory and how much should be spent on the nuance and warnings about misapplication of the theory? These are questions I’d be interested to hear opinions about from you and your thoughtful readership.

A related question I have is on the “garden of forking paths” or “researcher degrees of freedom”. In applied research, do you think that “tainted” p-values are the norm, and that editors, referees, and readers basically assume some level of impurity of reported p-values?

I wonder, because it seems, if applied statistics textbooks are any guide, that the first recommendation in a data analysis seems to often be: plot your data. And I suspect that many folks might do this *before* settling in on the model they are going to fit. e.g. If they see nonlinearity, they will then consider a transformation that they wouldn’t have considered before. So whether they make the transformation or not, they might have, thus affecting the interpretability of p-values and whatnot. Perhaps I am being an extremist. Pre-registration, replication studies, or simply splitting a data set into training and testing sets may solve this problem, of course.

So to tie these two questions together, shouldn’t our textbooks do a better job in this regard, perhaps in making clear a distinction between two types of statistical analysis: a data analysis, which is intended to elicit the questions and perhaps build a model, and a confirmatory analysis which is the “pure” estimation and prediction from a pre-registered model, from which a p-value might retain some of its true meaning?

My reply: I’ve been thinking about this a lot recently because Eric Loken, Ben Goodrich, and I have been designing an introductory statistics course, and we have to address these issues. One way I’ve been thinking about it is that statistical significance is more of a negative than a positive property:

Traditionally we say: If we find statistical significance, we’ve learned something, but if a comparison is not statistically significant, we can’t say much. (We can “reject” but not “accept” a hypothesis.)

But I’d like to flip it around and say: If we see something statistically significant (in a non-preregistered study), we can’t say much, because garden of forking paths. But if a comparison is not statistically significant, we’ve learned that the noise is too large to distinguish any signal, and that can be important.

This seems too extreme a reaction to me. If you know Type II errors are more costly than Type I errors, then it makes sense to reverse the logic. But I don’t think there is any general case to be made one way or the other. So, if we emphasize the cases where noise overwhelms the signal, aren’t we just committing the same type of mistake (as focusing on cases where the admittedly biased signal dominates) but in the other direction? Just because practitioners, reviewers, editors, researchers, and the public don’t appreciate the shortcomings of “significant” findings, doesn’t mean that overcompensation is the answer. Surely there is a more balanced way to approach this?

I’ve been struggling with related issues while I try to modify the protocols for Registered Reports , which are primarily intended for purely experimental work, to a field that includes a lot of hand-collected data archives. I’ve been thinking of it this way: a theory-testing empirical paper contains three basic contributions:

1. Collecting data that can yield patterns that can potentially distinguish the proposed theory from alternatives.

2. Conducting hypothesis tests based on the resulting data.

3. Conducting additional exploratory analysis that helps readers understand the robustness of the results and provides fodder for new theory.

Those who advocate registered reports demand not only that authors should spell out hypotheses before collecting/seeing the data, but also that authors spell out their “analysis pipeline” to reduce researchers’ degrees of freedom. How can you trust their conclusions if they don’t spell out in advance what types of tests they will use, how they will deal with outliers, subjects who don’t pass manipulation checks, etc.

However, I am getting a lot of pushback from my colleagues. Some are arguing that there simply isn’t much value in (2). If they collect useful data (1), and then use all the information they can get from looking at the data (3), what does (2) do for you? Won’t the analysis typically be awful, because they haven’t considered all of the useful forking driven by the realization of the data?

Bringing it back to the OP, if my colleagues are correct, what is the value of teaching hypothesis testing in the first place? If those who want to see a prespecified analysis pipeline are right, doesn’t it mean that we should be focusing more on teaching methods of hypothesis testing that have built in means of navigating the garden of forking paths? (E.g., teach robust regression so that authors can spell out in advance how they are going to deal with outliers.)

Oh, and does it matter whether data is gathered through a simple 2×2 experiment or by hand collecting naturally-occurring data? Either way, there seems to be great value in allowing registered reports that value (1) and (3) enough to grant an in-principle acceptance that does not depend on the realization of the data. You have to gather enough data intelligently enough, and analyze it thoroughly enough, but no one cares whether the data support or reject the theory. I have this sense that (2) is more valuable for experiments than observational data, but I’m not sure why.

But I’d like to flip it around and say: If we see something statistically significant (in a non-preregistered study), we can’t say much, because garden of forking paths. But if a comparison is not statistically significant, we’ve learned that the noise is too large to distinguish any signal, and that can be important.

I like this idea very much, especially when dealing with public policy questions. The tendency to run off half-cocked at the drop of a p-value is too great. While I appreciate the comments from the more expert here, I would point out that the vast majority of those who use statistical methods in all fields never reach the level of sophistication of an upper-level or graduate discussion on this matter. In all fields, I’ve come to think we need to follow Hippocrates—First, do no harm.

the vast majority of those who use statistical methods in all fields never reach the level of sophistication of an upper-level or graduate discussion on this matterI think this is a point that deserves a lot more attention than it usually gets. I like Bayesian data analysis quite a bit, and I use it regularly in my own research, but it’s difficult to imagine that we wouldn’t be seeing pretty much all the same problems we’re seeing now if we completely replaced standard frequentist statistical tools with Bayesian tools.

This idea–“But if a comparison is not statistically significant, we’ve learned that the noise is too large to distinguish any signal, and that can be important.”—accords quite well with a point I want to make in a paper, where I’ve got a ton of data, I look for a simple effect (in this case, it’s a psych study with social media data, I’m looking for differences by day-of-week) and the effect sizes are tiny and insignificant. I was a bit uncomfortable making this case based on p-values alone, but seeing as I’m the one who went down the garden of forking paths and I’ve got a huge sample, it seems like a rigorous conclusion to simply say: nope, there’s simply no effect there! Does this make sense to others? What’s a more Bayesian argument for this point? (I’m just using OLS at the moment, but again, huge samples.)

Seth:

Don’t say there’s no effect. Better to say the effect is not statistically distinguishable from zero. If you want more, you could fit a hierarchical model—but if there’s really nothing interesting going on, maybe it’s not worth the effort to go there.

Since “statistical significance” is arbitrary and case dependent that is a strange attitude to take. Moreover if there is a strong underlying model based on components that are well proved, it is simply wrong. Where it gets really tough is where there is no possibility of expanding a study to provide a sample with more power, because of the Heisenberg effect (e.g. you change the system by measuring it)

When you say “statistically indistinguishable from zero”, does that mean there might be other, non statistical, ways of teasing out the effect?

Ok, I’m going to say it. This garden of forked paths stuff is a credibility destroying joke. It’s not possible to take a Statistician seriously when they claim the validity of an inference turns on whether the researcher would have the same psychological state in a different universe that yielded different data.

Look, for any data set if you try hard enough you can “successfully” (by statistician’s standards) model the data. This is true no matter were the data came from, how it was generated, what caused it, or even what the numbers mean. “Successfully” modeling the data thus doesn’t tell you whether you’ll see the same patterns in the future, or different contexts, or other populations.

The validity of those inferences depends entirely on whether the physical system modeled has the stability properties needed so that patterns do hold in the future, or other contexts, or other populations. It has absolutely nothing to do with the fact that you tried too hard to model the data.

This is a practical difference. If you’re right, then the solution is to use things like “pre-registration” to prevent modelers trying to hard. If I’m right than the solution is for researchers to independently check the physical systems they’re modeling have the kinds of stability properties they’re claiming and relying on. Creating a “valid” or “checked” statistical model by itself doesn’t check this.

This whole episode does illustrate one principle though. Anyone taking Frequentist ideas serious enough eventually has to invoke radically subjective theories like the Garden of Forked Paths in order to explain its failures.

Anonymous,

What makes you think that there are “physical systems” underlying social processes? Let’s at least admit that, if there really are “physical systems” underlying social behavior, we have no idea what they look like. So no causal inference on, say, voting behavior or discriminatory hiring practices or most of the things social scientists are interested in can really stem from an epistemology based on deduction from equations representing physical systems. It has to come from somewhere else.

In most social sciences, I think the epistemology of future causal inference comes from reverse causal thinking and an implicit rejection of the most severe kinds of Hume-ian skepticism – in the past, this kind of change A has led to this kind of effect B, and so in the future, if other conditions are still similar, the effect of A should still be B.

And where is the grounding regarding the reverse-casual arguments? It is not in the truth of the physical system being modeled and estimated. In fact, in Economics, we have been intentionally running further and further away from even trying to model the “true underlying process.” We are fitting models that are less and less interested in the total “physical process” and more and more interested in latching on to the right kinds of variation (implicitly making the right kinds of comparisons) in the world.

So for example – consider the (occasionally maligned by Andrew) RDD design. There is some running variable X, and outcome Y is naturally smooth across X. Then there is some threshold value X* such that on one side of X* people get policy Z and on the other side they don’t (so, getting into a “gifted” high school based on a test score X). We estimate some smooth function of X, and allow it to be discontinuous at X*, and measure the difference.

What is this but a terrible model of the “underlying physical process”? If we want to model the returns to elite education, predict elite education with only one test score being above/below X*, and use that prediction to estimate the returns to education, that is an intentionally terrible model of the physical process. But it gets us the estimate we want, because it compares very similar people who were or were not exposed to elite education, and compares their wages down the line.

I see very little value in getting the “right” physical process described. I see tons of value in making the “right” comparisons.

I also think your interpretation of the Garden is wrong, but I still struggle with it myself, so maybe I’m being too generous to Andrew. But in the above example, suppose you were to plot your data along X, see a linear-ish cloud, and decide to run one “analysis” – an RD with a linear control. Well, if you had seen something else in the raw data, you’d have selected a different analysis – perhaps using a cubic polynomial over X. I think that reasoning right there is sufficient to make the Garden a legitimate problem, with no “psychological state of the researcher” involved at all.

JRC,

You’re way over thinking this. Suppose you claim there’s some biological parameter mu which is the same for the 20 year old coeds you tested at University of Who Cares as it is for 80 year old women in India, and moreover it doesn’t change in time.

The validity of your inferences from your 20 year old coed sample to the women around the world, or to future women depends on mu actually existing and having the properties you assumed. But here’s the thing: all the statistical model checking and verification done on that 20 year old coeds data doesn’t verify this. Not even a little. But people believe if they’ve got a “good” statistically “verified” model for the data that they’re ok.

This is why physicists have been so much more successful. Their usual practice would be to (1) verify mu exists and has the desired stability properties, (2) measure it carefully and (3) create a theory for it. Social scientists usually don’t do (1) even after the fact and then stand back in shock when their theories have no predictive value.

+1 about struggling with the Garden of Forking paths.

I’ve never really understood that position. Sometimes I think it’s just fishing but then Andrew says no, that’s different. At other times I think of it as an ultra-wide, all-encompassing nihilistic statement about not believing any theory / analysis at all. Since obviously the person may have thought differently under a different set of circumstances.

@Anonymous: You have excellently summed it up as

“the validity of an inference turns on whether the researcher would have the same psychological state in a different universe that yielded different data.”I’m not even sure how to deal with a claim like that.

“the solution is for researchers to independently check the physical systems they’re modeling have the kinds of stability properties”.

This is the frequentist solution too. Frequentist methods are used only when they afford stability properties: the stable relative frequencies are often generated in simulations, as in resampling. That is, statistics enters when you can’t independently check that the domain-specific physical systems have these kinds of stable properties. They may enable checking that specially designed methods that interact with aspects of the phenomenon do. That’s where statistics enters to parallel what’s done in theory rich sciences.

I like to always do a simulation study first so, even before the main analysis has been done, we know the effect size that can be resolved by the design.

After that, when the main result is that we could not reject H0, I conclude that the true effect size is bounded within the range -a to a.

Why not try teaching tests correctly rather than building on half-baked erroneous animals like NHST that ignore power, were never part of any official testing methodology, and fallaciously allege you can go from a statistical to a substantive inference, ignoring long known fallacies of rejection. The pretense by psychologists that statistical tests have no way to interpret negative results is especially egregious given that Cohen labored amongst them to inculcate power analysis. the text from my son’s high school AP class in statistics beats out the discussions I see daily–and they even include tests of assumptions! Here are a couple of things, the first from our recent seminar.

http://errorstatistics.com/phil6334-s14-mayo-and-spanos/phil-6334-slides/phil-6334-day-6-slides/ (see second set)

http://www.phil.vt.edu/dmayo/personal_website/2006Mayo_Spanos_severe_testing.pdf

Hey Mayo, you’re smart on the evils of p-hacking and are known for distinguishing “real” from “nominal” p-values. You claim that many (if not most) of these problems are caused by people quoting small “nominal” p-values when the “real” p-value is much higher. I have a question which your expertise might be able to answer.

Suppose there are two scientist working on the same project in the same lab and they both get the same data at the same time. They’re partners and will publish one paper together, but they do their analysis separately. Scientist A is only interested in one thing and so does one hypothesis test for the that single effect. He finds out it statistically significant since the p-value less than .01.

Scientist B has broad interests, so he takes the same data and conducts 500 hypothesis tests including as it happens the same one that Scientist A performed. All the tests fail except for the one that both A and B did. Scientist B obviously finds A’s hypothesis statistically significant at the same level as well (it was the exact same test using the exact same data).

So my question is, when they write their joint paper together is their p-value “nominal” or “real”. That is to say, should they report a statistically significant effect or should they dismiss the effect they found as merely the result of p-hacking?

Anon:

I think Scientist B has the right idea, but I think he should fit a hierarchical model to make better use of all the data, as discussed in my paper with Hill and Yajima.

Andrew, here’s a Garden of Forked Paths question for you:

Two scientist working side by side, with the same data and same assumptions. Scientist X only has expertise in one thing so he can only do one hypothesis test and gets a statistical significance result from it. Scientist Y has greater expertise but is lazy and perhaps unscrupulous. Although Y doesn’t communicate his goal to X, he intends to keep conducting tests until he gets a stat sig result. It turns out though Y does the same initial calculation as X and gets the same answer as X. So Y quits performing tests satisfied with that one test.

Now according to you, Scientist X isn’t guilty of the Garden of Forked Paths, but Scientist Y is guilty as sin. Mayo would undoubtedly agree with you because Y is guilty of a kind of optimal stopping and she takes its as a fundamental principle that this alters the legitimacy of Y’s calculation.

So my question is what happens when they write a joint paper? Is their published result legitimate or not?

I stress they both physically did the exact same thing with the exact same data, on the exact same experiment, and got the exact same numbers, the only difference between them is their personal psychological intentions.

Anon:

The garden of forking paths is a serious concern when results are summarized by a p-value on a single comparison. It doesn’t come up so much when a more comprehensive hierarchical model is fit, that includes a large number of comparisons of interest.

Ultimately if I’m interested in a particular science or engineering question and I want inference based on some data, I’ll prefer to do a Bayesian analysis, and the p-value for a null-hypothsis-significance test is irrelevant. For example, in that ovulation-and-voting study, I’d do a Bayesian analysis, throw in all the data, and find out that there’s no evidence of anything going on.

The issue is that, in many cases, researchers publish papers

withoutall the data. All they present are some comparisons and p-values. In that case, yes, as an outsider I need to know how these particular comparisons were selected. I need to know the selection rule.Let me say it again: Conditional on all the data in the experiment, I would analyze X and Y’s data in the same way. But if all I get to read is a published article, and this article reports only one comparison, then it is important to me to know how the comparison was selected.

Once again: if they publish one paper, which they easily can do because they literally did the exact same thing, only their intentions differed, then what’s the proper interpretation of their result?

Anon:

If they published papers with appendixes with all the data, I think the proper interpretation is to analyze all the data in the appendixes, making use of whatever prior information is available.

If they published papers only with one particular comparison, then the proper interpretation depends on the selection rule. It’s the same as in the Monty Hall problem or in any sampling situation: the data don’t tell the story on their own, you also need the likelihood, which includes the probability of reporting whatever has been reported. We discuss this in chapter 8 of BDA3.

There is no unique likelihood! in essence each research has their own likelihood. One says good things the other says bad things. So when they publish their joint results how are we the consumer of research journals to interpret their result?

Why is it so difficult to get a straight answer? You’re the one who claims their motivations affect the result. So when two researchers have incompatible motivations but publish a joint paper, what’s the proper interpretation, according to you, of their work?

Anon:

You write: “One says good things the other says bad things. So when they publish their joint results how are we the consumer of research journals to interpret their result?” My response is that the best thing is to analyze all potential comparisons in the data in a hierarchical model, not to look at just one comparison and worry about selection.

If, for whatever reason, one is restricted to only analyze one comparison or one small set of comparisons, then we need to model the selection involved in the choice of comparison. The issue isn’t motivation, it’s selection.One researcher has good motivations which legitimize the result. One has bad motivations with de-legitimizes the result. Since they got the same result, they are able to publish a joint paper stating that result. Should we consider the result legitimate or not?

It’s a simple question. I’d really appreciate a straight answer.

Andrew,

The selection has been analized! One used a good selection rule, the other used a bad one.

They got the same result and published it together. What are we to make of their paper?

Rereading this thread, it’s clear that the best thing to do is analyze the data correctly in some sense. That’s not the issue. The issue is given the same “foundation” you used to make claims about p-hacking and the Garden of Forked paths, how should we interpret that join paper (according to your “foundational” principles)?

Still no answer.

Could it be that there is no answer because the Garden of Forked Paths had a crappy foundation and is nonsense?

Responding to Anon’s criticism:

The attitude you should take with regard to published findings doesn’t depend on what’s in the head of the author, but on whether you’d expect such findings to be published only if they’re true. Since the “garden of forking paths” is a pervasive problem in many fields (and presumably in the hypothetical field in question), you would be wise not to trust the finding. Even if it happens the “good” researcher were the sole author, you probably have no way to know for sure that she was “good”.

What if you personally knew the “good” researcher, and had reason to believe that she was “good”, and could somehow correct for your own bias as her friend? In that case, the question is: is it more likely that she just got lucky, or that the effect is real? Even in that case, the question mostly depends on how much of a basis she had for posing her hypothesis. If it’s just a random guess, then I’d say it’s still more likely that she was lucky. I don’t know what Andrew would say on this point.

But no, taking TGOFP as a valid critique does not mean you suddenly need to be able to read the author’s mind in order to decide what to make of a study.

Quinn:

I propose the “Fundamental Law of Philosophy of Statistics” which says,

“If your understanding of probabilities makes you think scientific failures are due to the fact that researchers would have acted differently in a different universe that yielded different data, then you need to abandon your philosophy of statistics wholesale and star over from scratch.”

Actually let me spell this out plainer. The basic implication is:

(Garden of Forked paths) implies (lots of failed science)

you could substitute any number of other things for “garden of forked paths”, but I’ll take that as an example.

Now everyone goes out and observes lots of failed science. They then draw two fallacious conclusion, (1) their philosophy of statistics which pointed them to GoFP is right and (2) that GoFP an important ingredient in all this failed science. Neither of those conclusions follows.

Then having failed to correctly identify why the science is turning out so bad, they base science reforms on these conclusions. Instead of addressing what’s actually causing the science to be bad, they start talking about things like pre-registration, as if the date the scientific question is asked affects the truth of the science.

Both scientists find the same evidence regarding the hypothesis that is in common, as they both had the same relevant data and the evidence is in the data. Their procedures have different error rates associated with them if they make a decision because the frequentist error rates are associated with the method.

Scientist B has the more efficient approach to finding evidence but the more risky approach in terms of false positive errors. Both scientists should know that the best way to know if their evidence is reliable is to do another experiment to test that hypothesis again.

I reckon that this stuff is confusion because we tend to conflate the evidence with the reliability of the procedure. The label NHST is a warning sign that the conflation is going on.

Michael and Andrew,

Fine points, but Mayo thinks she can save frequentist hypothesis test by recognizing p-hacking. At the very least this requires an objective definition of p-hacking. So in the example cited is the p-value reported in their joint paper illegitimate or not? Is it the result of p-hacking or not? Mayo says “no” if it were A alone, and “yes” if it were B alone. When they work together what’s the objective answer? What if there were 40 scientists working together as often happens?

Here’s the larger point: both A and B calculate the same numbers from the same data and the same assumptions. The only difference between the two is their psychological state. Mayo wants to use that subjective psychological distinction to weed out bad abuses of p-values. Like I said before, anyone who takes Frequentism seriously enough for long enough eventually proposes a radically subjective theory in order to try and fix it.

I’m not trying to save error statistical tests; if they disappear, or continue to wane as they are, they will only have to be reinvented if we’re to obtain knowledge from statistics (it’s only a minority source of knowledge,for special areas.) They wasted over a decade on microarrays failing to randomize, and they’ll waste more by ignoring reliable methods.

It’s not psychology when you have a methodology that picks up on differences. Savage and others declared optional stopping a mere matter of intentions, locked up in someone’s head, only because he lacked the means to pick up on the very real difference in error probabilities of the two methods. I suppose the unreliability we see is all in our imaginations too, maybe we’re just dreaming that Anil Potti could actually defend his method of model “validation” using priors.

In any event, the effects of data-dependent selections, post data subgroups, barn hunting, while real, can at times have their threats removed with subsequent information. No thanks to the method, but other info can license the result. Moreover, in many other cases, selection effects, data-dependent hypotheses, double counting are legit and satisfy error probabilities, e.g., in testing assumptions and many other areas. Part of my work has been to make these distinctions. It grows out of a long-standing problem in philosophy about ad hoc hypotheses and novel predictions.

You didn’t answer the question. According to Error Statistical philosophy how should we interpret the paper the two scientist write together. Is the p-value they report real or nominal? That’s a distinction you harp on a lot about and insist is crucial to dealing with these problems. So why can’t you just give a straight forward answer?

On scientist does one test. We can even assume this scientist has neither the ability or interest in ever do any other test. The other does the identical test along with 499 others. They write up the one identical test they both performed in a paper and publish it. Is the p-value reported in that test “real” or “nominal” according to you?

It’s a simple question. Why can’t your philosophy answer it simply?

Anonymous, you are asking an unanswerable question. It’s not just Mayo’s philosophy that cannot answer it.

The P-value in the paper will be real with respect to the evidential component of the data, in that it points to the correct likelihood function. If the experimental protocol was affected by the multiplicity of testing then there would be an increase in the unreliability of the evidence, but no change in the strength of the evidence or the parameter values that it favours. However, I don’t think we can decide whether multiplicity matters to the methods used.

Mayo claims avoiding p-haching is going so save frequentist statistics from a whole mess of embarrassing “research”, but now you’re saying that the question “was this p-haching” is unanswerable in general even when you have all facts in hand.

“They wasted over a decade on microarrays failing to randomize, and they’ll waste more by ignoring reliable methods.”

The need for randomization in microarrays has nothing to do with the neglect of statistical tests and everything to do with systematic error/bias/confounding. In fact analysts were very anal about not only using p < .05 hypothesis tests, but 'carefully' incorporating various multiple comparison adjustments into their analyses to keep type I / FDR error control 'valid'. Lot of good that did.

How is Bayesian any better? If you are critiquing on the grounds of subjectivity there’s tons of that in the Bayesian approach.

What if there was a third coauthor called Scientist C. This guy only did the one test as did Scientist A, but he intended to do hundreds more. After see the positive result on the first test however, he quit and never did the others. Lazy snot.

Now when all three of the guys coauthor the same paper, with the exact same numerical results using the same data/assumptions, what is the status of their reported p-value. Is it legitimate or is it an example of p-hacking?

What if in an alternate universe all these guys in your hypothetical example were Bayesian? How would that improve the situation?

There are two problems in reality. One is classical hypothesis testing implicitly contradicts the sum and product rule of probability. If you do real Bayes and not an ad-hoc combination of freq and Bayes, then that automatically illuminates those problems.

A second problem is that frequentist philosophy causes researchers to believe they’ve “verified” certain physical claims when they haven’t. Bayes doesn’t fix this directly but at least if you combine Bayesian methods with a Bayesian foundation (most bayesians actually use a frequentist foundation in reality) then you won’t be fooled and you’ll know there is still something that needs to be verified.

@Anon:

I don’t see your confusion. Scientist A is reporting the real p-value & has a legit effect.

B, if he reports, this one p-value, is cheating. That would be fishing.

I don’t see the contradiction. The effect is real. One test finds the effect. The other approach wasn’t strong enough to detect the effect.

The reasonable course might be for B to treat this analysis as exploratory in the light of what he observed. Then collect another data set with the express goal of only testing for that specific hypothesis. And if he again gets p<0.01 then publish.

I agree with Rahul. This is the kind of example tediously repeated as if it’s preferable to have an account that couldn’t make or care about any distinction due to selection effects, or address the problem. Two people could also get the same clinical data/subjects, one by randomization, one by a judgment selection, and we’d be able to say there’s a difference in the capability of the two methods.

It’s not an example. It’s a question, and it telling in the extreme that none of you can answer it. The two scientist write a single paper and report the p-value which they both separately, but identically calculated, the only difference is one scientist only did one test while the other did an additional 499.

Is that p-value and it’s associated inferences legitimate or not?

Hey Rahul, you’ve given me an idea. We should push to only allow single author research papers. That way we never have to face the possibility that one author gets the numbers from p-hacking while the other gets the identical numbers legitimately.

So here’s the list of Statistical Reforms that are going to save Statistical practice!

(1) Pre-registration. Because truth doesn’t depend on the answer so much as when you decided to ask the question.

(2) The Garden of Forked paths promise. Make researchers swear on a their mothers grave that if they had lived in a parallel universe, they would have done the identical analysis. Those that don’t swear this will have their research papers burned.

(3) Single author papers. That way we only have to divine the psychological state of one researcher and never have to worry about multiple researchers having incompatible psychosocial states.

That should clear up all the problems. Sounds like you super geniuses are on top of the situation!

Mayo:

Teaching tests correctly is fine—I think I do a pretty good job of it in chapter 6 of BDA! As you note, part of teaching tests correctly is to recognize that hypothesis testing solves some problems but not others, and that it is not appropriate to use rejection of a specific statistical null hypothesis A to assert the truth of a general scientific hypothesis B.

I also think there’s a matter of attitude. If you teach methods and roll your eyes, it’s quite different from having them taught by a Lehmann, a Cox or a zillion and one frequentists I can name. It may not be possible to be immersed in one kind of method in one’s research and instruct in a very different kind of methodology with the same depth of enthusiasm and commitment as someone whose research is in the latter. It would be one thing if the divide were more like different theories in other fields, but here there’s nearly always a deeply felt bias, conscious or not, that invariably seeps into instruction. I don’t know this is always happening, but with leaders in the field especially, I think it does. Perhaps statistics should be entirely taught by pre-packaged courses designed by researchers in the respective approaches.

Mayo:

Nobody’s talking about rolling their eyes. The issue is that students’ time is limited and we have to choose what to cover in a course. There are a zillion things I’d love to cover if I had the time. I would not spend any time at all on type 1 and type 2 errors—except that these terms do float around in scientific circles and so it’s probably a good idea, at least for more advanced students, to explain what they mean.

Regarding Lehman, Cox, etc.: What can I say? Laplace was a brilliant mathematician, more brilliant than Lehman, Cox, and me put together. When Lehman and Cox wrote books that included only glancing methods of Bayes or that presented arguments as to why they preferred non-Bayesian approaches, were they “rolling their eyes” at the great Laplace? No, they (Lehman and Cox) were just presenting their own perspectives, to the best of their abilities. That’s what I do too.

Regarding your last sentence, all I can say is that researchers in all sorts of different fields come to me to learn Bayesian statistics. Statisticians do research on statistical methods, and there’s a reason why non-statisticians want to learn from us.

> there’s a reason why non-statisticians want to learn from us.

What might not be obvious to non-statisticians is that it is hard even for statisticians to grasp statistics well enough to apply it thoughtfully in empirical research. Most of that has to be learned post Phd – actually working in research.

Perhaps, many statisticians don’t get to that level, from a nominal understanding of statistics to a pragmatic or purposeful one (to use Peirce’s vocabulary). Given the effort and time involved in first getting that nominal understanding of statistics, few non-statisticians are likely to get that and beyond (although I do know some.)

+1

_ The issue is that students’ time is limited and we have to choose what to cover in a course._

Yes. As some may perceive the typical presentation of statistical testing in many textbooks as lacking… merely leaving students with an over-confidence that they understand testing, simply because they can carry out the calculations, this may lead them to worry that there just isn’t enough time to get all of the students beyond the “knowing just enough to be dangerous” with hypothesis testing stage.

You’re missing the point which I thought had to do with the question I initially answered (your title): how should we teach hypothesis testing? (frequentist) I wasn’t talking about the value of going to Gelman to learn Bayesian statistics. I’ve no doubt one learns a lot from your classes.

Mayo:

In answer to the question, What should we teach about hypothesis testing?, my response is that we should teach the

underlying goalsof assessing model fit and of using data to adjudicate between competing scientific hypotheses, and we should also teachclassical conceptsof null hypothesis significance testing, type 1, and type 2 errors, but we should also explain why these concepts typically don’t address those goals. And then I’d move to type M errors, type S errors, and multilevel models. How much of all this to cover? It depends on the level of the course. In the simple, no-math course that Eric Loken, Ben Goodrich, and I just created, we talked about statistical significance and the null hypothesis but we did not talk about p-values or type 1 and type 2 errors. In a more advanced course I’d explain more.Statistical tests aren’t aimed at adjudicating between substantive scientific hypotheses. The tests are designed to assess and control the probabilities of erroneous interpretations of data that are modeled in terms of aspects of statistical distributions. N- P certainly intended type 1 and 2 errors to assess magnitudes and directional errors, Fisher also covered what you call M and S errors (see Cox’s typology of significance tests), So I think it’s incorrect to say the tests can’t address these goals.

Students should learn how tests and estimation methods enable going from inaccurate to more accurate data and using specially designed data collection to uncover features of what generated the data–along with precision and accuracy assessments. To teach methods and say they don’t accomplish your goals is, in my judgment, misleading. Let the tests be taught authentically without viewing them through a particular philosophy e.g.

http://errorstatistics.com/2013/12/19/a-spanos-lecture-on-frequentist-hypothesis-testing/

Mayo:

I think the problem is that many researchers want to use statistical significance to prove that some hypothesis is true (indirectly, by proving that some other, “null,” hypothesis is false), but I think statistical significance is most helpful when it is absent, to inform people that the data alone are too weak to tell us much about some question being asked. This is what I mean tin my above post when I said that statistical significance is more of a negative than a positive property.

But once I take this position, this implies a huge change in how statistical significance, p-values, and hypothesis testing are taught. Cos in the standard treatment, it’s all about getting a low p-value and getting statistical significance and learning something important.

Andrew:

I really like this statement :”we should teach the underlying goals of assessing model fit and of using data to adjudicate between competing scientific hypotheses” I think that is really what we want to do in terms of teaching students to think like researchers.

When you say you did not teach about p values or types of errors, but you did teach significance testing what exactly do you mean that you did? How do you talk about statistical significance in the absence of a value for alpha at least?

Elin:

I say that, if the model is correct, the 95% confidence interval will contain the true value 95% of the time, and that a comparison or estimate is called statistically significant if the 95% interval excludes zero, that is, if the estimate is more than 2 standard deviations from zero. I say that if the estimate is

notstatistically significant, that is, if it is less than 2 standard deviations from zero, then the difference could reasonably be explained by chance.But I don’t talk about one-sided or two-sided tests, or p-values, or the area under the curve, or the probability of seeing something at least as extreme as the data. Again, in a more advanced course, I’d need to cover some of this, if for no other reason than to connect with other things the students might hear from other sources—but I’d try to cover it in a holistic way, but discussing p-values etc as valiant but generally flawed ways of trying to get at the uncertainties involved in scientific inference.

Andrew: to reply to your remark below that you teach that an “estimate is called statistically significant if the 95% interval excludes zero” is to encourage highly dichotomous construals of the sort that even a p-value avoids.

Andrew:

It sounds like you do it somewhat like I do (no areas under curves, no one sided tests, more of a focus on a verbal explanation from both me and the students (“explain in your own words what this means”)). Because I spend a lot of time on sampling, we do talk about possibility that you could get a sample with no relationship even though there is one in the population and the possibility that you get a sample with a relationship even when there is none in the population. You are right that a lot of it is about making choices about how to use limited class time and what makes sense for audience. I have a lot of social work students in my classes and for them it is very important that they consider the possibility that what they are doing is having either 0 effect or the opposite of the intended effect (and I’m trying to convince them to use data rather than rely on instincts), and I think those are somewhat different things than M and S. On the other hand for political science analyzing poll data, they aren’t doing anything that impacts people directly.

I really think a lot of this discussion goes around in circles because people are thinking about differing kinds of data, purposes of analysis, and types of students.

Elin:

I agree with you. In my teaching and textbooks I’ve tended to focus on methods and models. Discussions such as this one have led me to believe that I should be starting with the scientific and engineering

goals(for example, generalizing from sample to population, estimating comparisons, discovering problems with theories, adjudicating between theories) and then connecting these goals with the methods. Rather than saying simply that I don’t like p-values or I don’t like null hypothesis significance testing, it would be better for me to acknowledge the real goals that lead researchers to develop and use these sorts of methods, and then explain what I don’t like about the methods, how I think the methods don’t address the goals as directly as people might think.But, as you and I both note, it is necessary to make some hard choices about what to cover in a 13-week course, given all the above, not to mention the limitations of many of our students when it comes to mathematics, programming, and experience doing scientific or quantitative research.

And, on top of all that, we all know that “wrong” methods can still be useful. Bayesian methods with bad priors can still give good answers; classical moment-based and likelihood methods can do just fine even when they are based on probability models that don’t make sense; regularization methods such as Lasso and Bic can be useful research tools even if their implicit models of complexity tradeoffs don’t make sense in the examples where they are applied; etc etc.

So, teaching statistics is tough. Part of the problem is that we are teaching principles, methods, and practicalities all at once, and they all get mixed up in students’ (and instructors’) minds.

Brian McGill wrote a really nice post about the value of exploratory statistics, and how it ought to be acceptable to be upfront about this – both in papers and even grant proposals. Just don’t pretend to hypothesis testing – admit to hypothesis searching.

https://dynamicecology.wordpress.com/2013/10/16/in-praise-of-exploratory-statistics/

I would very much like to see the jargon “statistical significance” to be abandoned altogether, and would prefer to teach and think about hypothesis testing from the viewpoint of model checking or consistency. That is, if I were to teach hypothesis testing, I would say something like:

“Let’s write down a probability model for (potentially observable) data in an experiment and then use this model to calculate the probability of observing something at least as extreme as we did in the actual experiment we conducted. If it turns out that this probability is very small, then perhaps we can’t trust the probability model we wrote down to describe the process that generated the data. Furthermore, note that we still have to think about what it means if this probability is not small — at no point can we accept or reject anything.”

It seems that throwing in jargon like “significance” only contributes to imprecise thinking. I don’t think the concept of hypothesis testing was ever meant to be packaged as a black-box method with “significance levels”, as it (seems to be) so widely used today, opposed to a general methodology for checking if a probability model is reasonable for describing a set of data.

Additionally, it seems to me that the p-value has no intrinsic meaning: it is just some unitless measure of extremity where extremity is defined according to a pre-specified model. Perhaps its only salient property is that it is uniformly distributed and can therefore be used as a universaltest statistic (in the sense that the uniform distribution of the p-value is invariant to the actual test statistic one uses). But of course we still don’t escape the fundamental question: what does it mean for a data set/collection of statistics to be consistent with a probability model (in the p-value case, a uniform)?

Giri, you are right to feel that the P-value has limitations, the primary one of which is its model-dependence. However, it has a richer meaning than you suppose: it is a pointer to the relevant likelihood function. Within a model there is a one to one relationship between P-values and likelihood functions for the parameter of interest.

I have explored this idea in more depth, as well as answering some arguments against the utility of P-values, in a paper on ArXiv: arxiv.org/abs/1311.0081

The likelihood function gives a more complete picture of the evidence in the data regarding the parameter of interest, but it is specified by the P-value and it is model-dependent. If you explore multiple models then you will get multiple P-values and multiple likelihood functions from the single set of data. (Some of the models may yield the likelihood functions more easily than others, depending on the relationships among the model parameters.)

” use this model to calculate the probability of observing something at least as extreme as we did in the actual experiment we conducted. If it turns out that this probability is very small, then perhaps we can’t trust the probability model we wrote down to describe the process that generated the data.”

It would be a disaster to thereby go back to very early Fisher and ignore the crucial role of the alternative and power function. What would you infer after finding a small probability? invariably, some alternative model and this would be quite unwarranted because it’s errors haven’t been well probed by merely finding the small p-value. Further, there are many different distance measures to use, giving incompatible p-values. And why use the tail area? In short, you’d be better off starting from N-P testing with alternatives and constraints on what you may infer–if you were to keep to a single method. I have my own preferred interpretation of N-P methods, but the methodology at least (along with corresponding CIs) is superior to those rudimentary tests.

Hi Mayo:

I do not believe an alternative hypothesis nor a concept of a power function prevents ambiguities inherent in hypothesis testing, an approach which I am not convinced is sound whether viewed from the N-P, Fisherian, or any other viewpoint. (But I do see the value of teaching hypothesis testing from a variety of perspectives for pedagogical purposes.)

To more precisely address your points above:

1. I am not sure what one can appropriately infer if a small p-value is obtained, frankly. I hesitate to concede that this allows me to make any inferential statement at all, including “invariably, some alternative model”.

2. When the alternative is not a point value or one-sided (where the NP-Lemma tells us the LRT is UMP), for instance in a test of a coefficient in a (generalized) linear model being 0 versus not 0 (which is sometimes loosely interpreted as a factor having an effect or not), it is not clear what alternative model I am accepting if I reject the null hypothesis. Furthermore, if we are testing a simple null versus a simple alternative (a scenario in which it is quite clear what model I accept if I reject the null hypothesis), it seems we are injecting strong prior information into the problem by restricting our attention to precisely two parameter values.

3. I do not understand what is meant by different distance measures to use — are you referring to a metrics in the real analysis sense of the word?

4. I am not convinced that N-P methodology is superior to “those rudimentary tests” based on the above comment alone, but understandably a short blog comment may not be the appropriate medium to convey the requisite justification. It would be interesting to read your work on the supposed superiority of N-P methods; could you recommend particular work of yours on this topic I could read? Thanks in advance.

I would be very curious to see what goes into this introductory statistics course. One question I have about whether or not Bayesian statistics will fix anything just has to do with the fact that researchers who currently use Bayesian statistics are a self-selected bunch of quantitatively sophisticated individuals. I suspect if Bayesian statistics had never been invented, those people would by and large be doing quality analysis. Similarly, as performing Bayesian analysis becomes more accessible, there will be lots of new and exciting ways to perform lazy and sloppy research under the Bayesian framework. I say all of this as someone who does not actually practice Bayesian statistics, so maybe this is just me being defensive, but seems entirely plausible to me that if Bayes replaces frequentism as the dominant framework, we’ll still have slightly different flavors of the same problem.

I completely agree, and I’m someone who uses Bayesian statistics quite a bit. There are a large number of non-statistical issues at play in (social and medical) science right now, most (maybe all) of which wouldn’t be affected by widespread adoption of Bayesian statistics.

I suspect if Bayesian statistics had never been invented, those people would by and large be doing quality analysis.

(Of course, Bayesian statistics is much older than frequentist. Frequentism seems to have arisen partly in reaction to the perceived subjectivity introduced by the prior and tries – unsuccessfully in my view – to be completely “objective”. I have a hard time thinking of frequentist statistics as having come on de novo without the influence of its precursor.)

Bill: History could be seen, as initially treating the prior as just a nuisance to be minimized (up until at least 1880’s, when Galton made them proper and used them purposefully) then with Fisher and Neyman finding mathematical means to avoid them completely.

I like to think of Frequentist methods as valiant attempts to avoid any explicit use of a prior.

Easiest to explain in terms of interval estimation.

Bayesian: Draw a parameter from the prior, simulate the data, calculate a posterior interval and record the frequency of covering the truth.

Frequentist: The average coverage above (over randomly drawn parameters from the prior) involves the prior _unless_ the rate of coverage is constant for each and every possible parameter value (or as Normal Deviate once clarified, >= %coverage claimed.) You must find a prior that does this or a mathematical means to achieve the constant coverage for all parameters. (That can be hard to verify but as David Cox used to point out in any given application you can use simulation over a suitable range of possible parameter values.)

(Not everyone is happy with evaluating Bayesian intervals by their average coverage.)

What when we see something statistically significant in a *preregistered* study?

Why not pre register all studies? Treat every non-preregistered study as a strictly exploratory enterprise. i.e. It can be used for generating plausible hypothesis. But not for even a weakly confirmatory finding.

But if a comparison is not statistically significant, we’ve learned that the noise is too large to distinguish any signalWell, we’ve learned that in that particular case, no signal was observed, but this fact may or may not hold in general. Even if there is a signal, unless it’s enormous relative to the noise, there’s some probability that we’ll miss it in any given study.

I am pretty sure that “failing to reject a null hypothesis” is in no way at all telling us that there is no signal. It is telling us that the signal-to-noise ratio is low (which you allude to in the second sentence, but which I think is reasoned wrong in the first sentence).

jrc: you can determine how low by using attained power or severity.

How would that work without assigning a prior probability to the null hypothesis?

no priors needed.

a power calculation alone is not going to tell me how likely i am to make a false inference.

For example, I may have high power to detect a difference, not see a statistically significant difference, and yet this outcome was much more likely to be due to an unusual result of an alternative hypothesis.

We can pretend like base rates don’t exist, but that means being wrong an awful lot. Ioannidis, like him or not has made a career out of pointing out this obvious fact (based on a frequentist definition of probability even).

By power analysis in a one-sided test you would rule out values against which the test had high power, as with an upper confidence bound.

On ioannidis, insofar as one is interested in controlling PPVs in screening drugs for future follow up or over all or many sciences the computation assumes, as he admits, that you publish after a single small p-value and are allowed to bias via cherry picking and such (that’s how he punishes the priors). Largely outside the realm of assessing how well tested any given hypothesis is.

“Severity” doesn’t achieve anything of the sort except in instances where it’s identical or nearly identical to the Bayesian Posterior–as anyone who isn’t a fanatical Frequentist and is competent in mathematics can determine for themselves very easily.

“Severity” has no theoretical or empirical justification. It’s based on nothing more than the opinions of one philosopher who’s never done any applied statistics, doesn’t have the math background to thoroughly investigate the technical details, and has never made an original scientific inference in their entire career.

Given that creating “severity” was the high point of said career, I don’t expect that philosopher to back down and stop peddling it now. I just wanted you them to know that regardless of anybody’s opinions, hopes, and wishes, the truth of it all will come out eventually if the wider statistical community ever takes “severity” seriously.

please keep the personal motivations out of the discussion.

No.

When I see unscrupulous academics I say something (what do you do?). Last month when Andrew commented on a far more qualified statistician

(see here http://statmodeling.stat.columbia.edu/2014/12/13/dont-dont-dont-dont-brothers-mind-unblind/ )

the entire discussion thread was full of “personal motivations”. Same when Gelman comments on plagiarists and so on.

An introductory course for what kind of students?

Much of the discussion here seems like the type of discussion of topics which are often called “paradoxes,” so I’ll point out that the origin of “paradox” is “contrary to (or beyond) opinion (or received opinion).” I think this fits here, if the opinion in question is the commonly held belief that things should be simple and straightforward.

I’ve also been thinking a lot about an intro stats course, and it’s not an easy question.

With any kind of teaching I think you have to consider what your goals are, and that relates to who the students are. Are they math students, stats course to fill fill a general education requirement, social science students who need to think about some kinds of sampling issues that other people don’t, finance students, data science students who will be working with live streams of data about anything from cash register checkouts to electrical usage or whatever else, criminal investigators or lawyers who need to understand what the data from fingerprints or DNA evidence do or don’t mean, teachers who will be flooded with tons of scales and subscales about their students thanks to the regime of standardized testing, structural engineers?

Have they had any statistics before?

I think for any of them who have not had statistics before or who have only had some formulae, you should take as your overarching goal the idea that they need to learn to think about data and variables, starting with what they are and where they come from, but then fundamentally how to reason about them.

Then I think you might want to say, my goal is for them to understand how experienced people think about this problem, and also to understand that not everyone agrees on what is the best way to think about it. That– the idea that there is debate and passionate disagreement about a topic like statistics — is in itself is a revelation to most students. Now in some disciplines, this is kind of the normal state of things, people have different theoretical perspectives etc., but for other disciplines this is extremely unusual except if it is portrayed as some kind of linear replacement of earlier less good ideas with newer more good ideas. So this is not something you get to bring up once, you really have to revisit it repeatedly. Then you can say, I have my position, but you as students need to have reasonable understanding of the main ideas of and arguments for and against each. And you can also share the happy news that in the very basic cases typically covered in an introductory course (again I don’t know your audience) that practical results on the ground are not so very different. (But they should know that this won’t always be the case if they go forward to learn more after the introductory class).

Not knowing the level of the audience, I’ll just say that lets you then dive into both analysis and the problem of inference and really discuss, well how do frequentists approach this issue, how would you do this from a Bayesian point of view, how would you do this from other perspectives. I’d cover many fewer topics in exchange for going into that depth. Just pick comparing two means or a simple linear model and go at it.

My first statistics course was the stats for social and life sciences that G. Iverson taught, and looking back it was kind of an amazing class, and even though I left it able to talk about and do NHST, I sure had thought about a lot of other ideas too. (I still also think that the one volume edition of Loether and McTavish that the sociology students used (yes he had different text books for different majors) is a fantastic book.

Anyway, that’s my take, think about what will make your students be smart and capable when they are called on to read and to do analyses. Then build the course from there.

Elin:

I just created an intro stat course and we didn’t cover p-values at all? We did discuss the null hypothesis, and we discussed statistical significance. But no p-values. In a longer or more advanced course, I’d cover p-values and Type 1 and Type 2 errors, just to explain why I think they’re generally a bad way to frame statistical inference problems.

This is a discussion we’ve been having; I could definitely see organizing a class that way, in roughly 14 weeks there is not really that much time and I think there are other things more worth using the time for. I asked our planning group: when are students who don’t go to sociology grad school ever going to be using anything but a convenience sample or a sample that is all of their students or clients (if they become social workers) or a case study. I do think that it’s worth talking about reasoning and the kinds of mistakes we can make (false positive, false negative) because if we’re looking at say data on everyone who came into our store today or on spelling tests from 30 5th graders we should have the idea in our minds that even though we are looking at data we could be making some kind of mistake in our conclusions. I think that has to be part of the numeracy that you are trying to help undergraduates have, but that is different than teaching formal inference.

Now, that said, I’m old enough to have been in high school when there were still seniors getting draft numbers. I well remember reading the article about how what was supposed to be random was not really random in Statistics a Guide to the Unknown and thinking … that could have impacted people I knew. I think it’s wrong to think that because survey data is not nice random samples that it means we shouldn’t talk about randomness, lots of people by lottery tickets.

Now that said, I think it is pretty common for people to confront publications with statistical significance content. Not only articles they read in classes, but every drug ad and lots of other materials. I think in terms of numeracy it’s good for adults to know that if you have a really big sample you can get a “significant” result that is small, and that claims using a small sample with a big effect size are something be dubious about.

I definitely think it’s a waste of time to teach dozens of tests and probably would not prioritize formal testing at all, but I think it’s good to say that these are issues we should be thinking about. I also will say that even in my research class where it’s more about data collection, I’ll talk a bit about what “those plus or minus x percent” that you see when people talk about polls means. I do that when I’m talking about different approaches to sampling.

Elin:

I agree that it’s worth talking about reasoning and the kinds of mistakes we can make. I just don’t think that “false positive” and “false negative” correspond to the sorts of mistakes that people generally make in statistics. I think Type M and Type S errors make more sense (although those categories are not perfect either).

Andrew:

Yes, I like the M and S approach a lot, though I think that especially in applied settings a lot of times people really do think that effects are 0 (this program is going to do nothing). The S is really important too because of backfire effects. More harm then good is more common as a program outcome than people like to think.

I think it is unfair to students to not cover a concept like p-values at all. Won’t they enter industry etc. & be put into an awkward situation where someone mentions p-values and they having no clue.

Wouldn’t it be rather better to at least introduce p-values & point out how others use them, albeit improperly?

Wouldn’t it be rather better to hire someone who has taken more than a single introductory stats course?

I can’t pass this one up – wouldn’t it be better to hire a PhD qualified statistician – better yet, one trained in Bayesian analysis and competent as well? This seems to reflect a common view that data analysis should only be performed by the few. I don’t agree with this and believe it is an unpractical view as well. Increasingly, people are working with data and many of these will have inadequate backgrounds in the appropriate analysis. Rather than lament this fact, I welcome it. It is a good thing for people to want to look at data – even people with little background in its analysis.

Which brings us back to the question – what should an introductory course include? I think it should be robust enough to provide a student that finishes the course with enough ability to conduct analysis of real data, have an appreciation for the most important concepts, and also an appreciation for the limits of what they know based on their limited training.

I am increasingly coming to the view that Bayesian approaches must be included in the initial statistics course, though that was not my training (other than a section on Bayes Rule in the probability section with did nothing for my understanding of what it meant for data analysis).

It seems to me that hypothesis testing, decision analysis, Bayes, ecological correlation, and multilevel analysis are all related concepts and all belong in the first statistics course – but at a more applied level (at least for applied courses such as I teach – business statistics). I am not sure how to do this and would like to hear more ideas on that. I personally believe in the use of good examples – one or two good data sets that show how these ideas are related and demonstrate strengths and limitations of alternative analytical techniques. It is actually very hard to find such good examples – very quickly the “examples” become mathematical statements rather than applied to data and/or reliant on specific software implementations. It would be nice to have a couple of examples, using real data, to explore how multilevel analysis (perhaps in several varieties) and frequentist hypothesis testing (perhaps in several varieties – such as with and without interaction effects) compare.

I think the best solution is to introduce students to both approaches. If you see practical usage, it’s not right to ignore teaching either p-values nor Bayesian approaches.

Admittedly only suitable for students who can already code, but if you’re willing to delay the first course in statistics until that point this book has great ideas:

http://www.greenteapress.com/thinkbayes/

konrad – I’m a bit surprised people around here aren’t more critical about downey’s writings given somewhat silly/naive ruminations like http://allendowney.blogspot.com/2014/04/bayess-theorem-and-logistic-regression.html

Anon:

I think we have to give people a bit of slack when they are self-taught. As long as people like Downey are open to realizing where they’re wrong, I’m not bothered if they go off in some offbeat wrong directions. That sort of exploration is one way that we learn.

Just to take a different example: what bothered me about the himmicanes people was not that they made mistakes—we all do that—but that, when their mistakes were pointed out, they doubled down instead of just admitting they’d been confused.

Well…, I was just trying to troll Rahul, because I find his comments the most consistently interesting/inflammatory on the blog and was trying to provoke him; however…

“This seems to reflect a common view that data analysis should only be performed by the few.” … Not necessarily the few, but it should be performed by the competent.

“It is a good thing for people to want to look at data – even people with little background in its analysis.” On the one hand sure: Data should drive everything and curiosity is a good thing. But you seem to be suggesting that it’s a good thing that people who don’t know what they are doing keep doing things they know little about. I contend it’s a horrible thing for people with no background in ‘appropriate analysis’ to do anything at all w/out first gaining the appropriate background. When I feel ill, I can Google symptoms and self-diagnose. This doesn’t make me a competent doctor. I think it’s great my 6 year old son wants to drive my motorcycle, I also think it would be insane to let him.

Good analytic methodology and statistical understanding isn’t something you can pick up from a single course. There’s nothing “unpractical” about expecting people to work hard and take time to understand difficult things. This seems much more reasonable to me than to continue to proliferate the number of people who believe something akin to “I obtained p<.05, therefore I’m right” because that is all they took away from their only applied statistics course. I know the preceding sounds judgmental, but the more I have studied statistics (or even just tried to follow this blog and its commenters), the more I'm convinced of how incredibly little I know. No matter how amazing an applied stats course may be, I think we do students, and their potential labs/employers a disservice if students leave these classes believing they have the tools to conduct rigorous analysis. I know I thought I was prepared after my first course, and now also know I wasn't prepared at all.

Of course, this is just anecdotal evidence and a sample size of 1…

I agree with you, it makes perfect sense, but (and you saw this but coming) it seems unrealistic to expect people to know that they don’t know enough. People self-medicate themselves and I think the WHO even recommends a little of self-medication, because this reduces the burden of the public health system. We all know there are trade-offs and frequently the alternative to a bad data analysis is no analysis at all, just gut feeling. What is better, a hacked p-value (and with it overconfidence) of the overconfidence of gut feeling? And let’s not fool ourselves pretending people don’t have overconfidence based on their gut feeling alone. All in all, I really don’t know what to say or do. Maybe it’s unfixable and it’s better to spend our time doing another thing than trying to improve what is not improvable? Btw, in this case, I should delete this comment, right? = )

On one level, I agree completely – but I’ll have to disagree at a more important level. No subject can be properly understood after one course. I was trained as an economist and I assure you that people who have had one course have no business thinking they actually understand anything about the economy. The world is a complex place, and statistics is perhaps more complex than most things.

But, you are living in a dream world if you think that the solution is to insist people study more than one course. The trend is precisely the opposite – whether you or I like it or not. Increasingly, people are “studying” short courses – much less than a typical university course – and then using what they have “learned” in actual practice.

You can lament this as can I, but I think it is more productive to recognize that it is happening and figure out how to best adapt to it (even while complaining that people should have more preparation). I personally think our approach to college courses is all wrong. We have virtually abandoned believing that undergraduate education is worth anything and given up on introductory courses as a serious means to learn anything. Those courses serve more to ensure continuing employment of academics in traditional tenure track roles.

I interpret Andrew’s post as a real attempt to ask what the introductory course in statistics should really include – to revisit its purpose and content. Since I teach business statistics, I have a jaded view since that is usually the only course that business students take that remotely deals with analytics. Large companies, of course, hire sophisticated analysts with much more training. But many organizations rely on people with inadequate training – and most of the decision makers even in large organizations have had insufficient training. Surely, we can get across the importance of uncertainty, appreciation of learning and evolution of theories, and assessment of evidence in a single course. We have to give something up, however, and Andrew is suggesting that is might be hypothesis testing. I’m still not sure, but I am starting to believe that it is not worth the effort it takes (possibly doing more damage than good) to cover in the first (and possibly only) course.

To be an elementary school teacher or a marketing manager? No, I really don’t think it would be better.

If that’s the audience you are teaching to, nothing much matters anyways.

You don’t have children or have never worked in a business or been in a hospital that had nurses? Who do you think is in most undergraduate introduction to statistics courses?

+1 to Elin’s response.

@Elin

I’d love to see someone survey elementary school teachers’ baseline understanding of p-values.

There have been lots of surveys of statistical literacy. Most teachers don’t need p values and wouldn’t need them, they need to know: how to plot and interpret data they collect from all the students in their class, how to interpret the individual level test scores that they get (including understanding about uncertainty in measurement), enough to understand the way some statistics-running-people want to get them fired based on the basis of value added models so that they can respond effectively, and most important enough to be able to teach the students in their classes how to work with data in an age appropriate way e.g. basic graphing, appropriate use of measures of central tendency, data collection in social studies as well as in natural science, and so on.

My experience is that very few students come out of an introductory statistics course with good understanding of sampling distributions, let alone of the concepts that depend on them (p-values, confidence intervals). More catch on the second time around — at least, if the instructor doesn’t assume the students already understand.

Although I’ve never done this, I’m beginning to think that introductory courses ought to be (almost entirely) Bayesian, but including strong cautions that this approach is not the norm in many areas. Then relegate frequentist statistics to a second course, with emphasis on comparing and contrasting with the Bayesian approach.

Why do you think that is better than the status quo which seems to lean towards frequentist followed by Bayesian?

It’s an opinion that I have been very slow to come to; some things contributing to it:

1. The (repeated) experience of having few students understand the frequentist approach the first time around.

2. In particular, having so many students “intuitively” interpret p-values and confidence intervals as if they involve Bayesian probabilities rather than frequentist.

3. The enthusiastic response I’ve gotten to Bayesian ideas when I taught a course for secondary math teachers in a master’s program. They had all had a standard introductory course in statistics. I included a brief review of sampling distributions, p-values, and frequentist confidence intervals, but the students understood frequentist confidence intervals much better after we had done some Bayesian statistics and they could compared and contrast frequentist confidence intervals and Bayesian credible intervals. Many of them also said that the Bayesian approach made more sense to them than the standard frequentist approach.

4. Hearing a number of scientists and engineers say they prefer the Bayesian approach because it fits better with how scientists really think.

I know you mentioned texts the other day, but do you have a text in mind for that?

Elin,

Not sure exactly what you’re asking. Is it a text for the course for the master’s program for secondary math teachers that I mentioned in item 3? If so — I did not use a text for that course. I could probably dig up some relevant handouts, if you’re interested.

That’s fine, yes I was just wondering if you gave the students something specific to read or you completely did your own.

Martha:

“but the students understood frequentist confidence intervals much better after we had done some Bayesian statistics and they could compared and contrast frequentist confidence intervals and Bayesian credible intervals”

How did you take them through that?

(I commented above how I did it, very interested, as Elin seems, in others’ approaches.)

http://statmodeling.stat.columbia.edu/2015/01/24/teach-hypothesis-testing/#comment-208377

“How did you take them through that?”

My approach (literally) to Bayesian statistics in that course was probably unusual: I considered the course to be one in probability and statistics, not just statistics. So we did some basic probability, including, of course, Bayes theorem for discrete probability, using that to consider the fairly standard type of problems involving base rates, sensitivity, and specificity for a test for a disease to figure out “probability has disease given tests positive.”

Then I had them do more involved problems such as “Under what conditions would it be better to raise the sensitivity than the specificity?” (This used their math background — e.g., understanding partial derivatives and being willing to deal with somewhat messy stuff). The students found this very interesting and engaging.

After that, it was easy for them to tackle a problem I found somewhere (Journal of Stat Ed, I think?) that had three possible conditions and outcomes (I forget the exact details), that was a little more complicated in use of Bayesian statistics. That then set things up for the idea of prior, model, and posterior.

I was actually amazed at how well the approach worked, at least for this audience.

I’ve decided to go ahead and start posting some of the handouts from the course for teachers mentioned above, in case others are interested. (I’ve linked the page to my name in this comment.) One thing that prompted me to do this was that Elin mentioned (in another comment below) teaching social work students. Some of the problems in the second two handouts I have posted so far actually stem from an article in a social work journal. They aren’t about hypothesis testing, but about cautions in interpreting diagnoses (including the “Would it be better to raise sensitivity or specificity” question mentioned above) — and also help get the students comfortable with Bayes theorem, hence leading up to Bayesian statistics.

Marta:

“2. In particular, having so many students “intuitively” interpret p-values and confidence intervals as if they involve Bayesian probabilities rather than frequentist.”

My experience rather is that on top of what you write here (with which I agree) these Bayesian probabilities are misinterpreted as something for more “objective” than what they actually are, with no proper understanding of the dependence on priors, where they priors come from, and how well (or not) they can be justified. Furthermore, people seem to interpret the sampling model in a frequentist way and the Bayesian parameter posterior as epistemic, which means that frequentist and Bayesian interpretations are mixed up in a way that is not really justified by any interpretation of probability applied consistently.

So I honestly don’t believe that the outcome is a progress compared with starting off frequentist and at least postponing the issue “what does the prior mean and where does it come from” to later.

“The enthusiastic response I’ve gotten to Bayesian ideas…”

I have seen a particularly enthusiastic response to Bayes from scientists who wanted a “probability for a hypothesis being true” without taking the responsibility to think about a good prior themselves, but who rather expected that this comes from some kind of objective machinery that could do its work without requiring any input. In other words, people who want the Bayesian “output format” coupled with a (pseudo-)frequentist sense of objectivity/”all work is done by the data alone”.

Christian: Agree, and that has almost uniformly been my experience with how folks do Bayesian analysis in actual practice (even some with a lot of training in Bayesian theory and methods).

Part of the challenge might be that not many statisticians teach and later work in a settings to experience what was learned rather than what was thought to be learned.

On the other hand, I think one can start with “what does the prior mean and where does it come from” and show what happens both when the prior happened to right and happened to be wrong. Maybe just with graduate students.

Xian put it nicely on slide 128 here https://xianblog.wordpress.com/2015/01/26/a-week-in-oxford/

“Ungrounded prior distributions produce unjustified posterior inference”.

Yes, I agree with Christian H. and Keith’s comments about how Bayesian statistics can also be misused and misunderstood. I think that we need to teach in ways that point out the possible confusions and caution against them.

Rahul,

I’ve never conducted a formal survey, but my teaching experience informs my opinion that more students find the Bayesian setting more consistent with their pre-existing worldview and biases… that random variables can be used to model uncertain things. So if parameters are uncertain, let’s use random variables to model them.

Then, there seems to be a significantly lower barrier convincing students of the usefulness of statistics as a field of science when you calculate a credible interval and do not have to perform what I’m sure many students initially perceive as pedantic, semantical gymnastics in order to interpret said interval.

From what I’ve seen at many universities, the status quo isn’t frequentist followed by Bayesian. Undergraduates are lucky if they *ever* get exposed to any Bayesian concepts.

And this seems strange to me… Perhaps Mayo would be willing to comment on this… Because from what I understand about the Philosophical foundations of probability theory, it isn’t like the question of “frequentist probability” vs. “subjective probability” is a settled issue. There seem to be philosophers that continue to argue that subjective probability is the *only* coherent concept of probability… and this debate continues, long after de Finetti and Savage.

Here’s a link to how I have described probability to students for at least the past ten or so years. :

http://www.ma.utexas.edu/users/mks/statmistakes/probability.html

As you can see, it’s not a question of “frequentist probability” vs “subjective probability,” but more of “interpretations” of probability, with an interpretation being valid if and only if it satisfies the probability axioms. (For a version within the context of a continuing education course, see pp.9-14 of http://www.ma.utexas.edu/users/mks/CommonMistakes2011/Day1Slides2011.pdf_

Rahul: Few philosophers do philosophy of statistics these days so I can’t generalize. Most of the philosophers I know who do formal epistemology are allergic to anything subjective; the older ones like Howson are still subjective. This is largely a holdover of logical positivism. The positivists denied causes, generalizations–all as metaphysics. We’re locked “in here” in a solipsistic world of beliefs; only simple observations pass the “verifiability” criterion. Few go back to the early treatises that marked out the views that underly the famous, central subjective texts.

I think you can convince anyone of the importance of statistics if you start from a very primitive idea that we (or at least we skeptics) use every day: was that a real test? what that horribly tested? Are you interested in holding accountable the “experts” who tell you there’s good evidence of something when they’ve cherry-picked and mixed in their opinions, biases and political interests? How uncertain you are in things, sure that has a role, and some of the times (in very special cases) you can use probability models to capture this. That probability. But ‘how good a job did they do at testing your drug? genes, radiation risk? ‘ are questions of retaining fundamental freedoms from what those in power want you to believe or buy. If you want to know how incapable their methods were at discerning exposure to radiation, you are asking after an error probability of a method. That’s what inference rests upon.

@Mayo:

Thanks for the comment. Although, most of the philosophical, epistimological business is beyond me. Perhaps because I lack formal statistical training.

+1 to Martha’s and JD’s comments. I’ve also (slowly) come to the conclusion that scientists and engineers most naturally think in terms of ‘modeling’ than ‘tests’ and so the bayesian approach of directly modeling the parameters as uncertain is the most natural generalization to extend their thinking to probability/statistics.

I think ‘tests’ – whether frequentist or bayesian – and ‘error properties of methods’ are more likely to appeal to philosophers and logicians than scientists and engineers. In some sense the current shifts towards modeling-oriented approaches seem like a sort of ‘bottom-up’ rebellion by scientists and the like, who from my experience almost never report having enjoyed statistics*. This might explain the enthusiasm for these approaches outside of traditional statistics settings. The massive differences in attitude towards these vs. traditional methods is really quite astounding, and worth capitalizing on rather than fighting, IMO.

Having finally actually read (some of) Andrew et al.’s book even I’ve found it a genuinely refreshing approach, despite my occasional snarky comments here.

* I should add that this is not to my knowledge driven by a lack of technical ability, but by a feeling that the approach does not naturally fit with what they are learning in other courses, with the vast majority focusing on models and theories rather than ‘tests’. Perhaps this unnaturalness might decrease with time, but Andrew’s approach of discussing in the context of model checking as one part of the modeling and analysis cycle again seems a better approach (for beginners at least).

I think that the biggest problem with hypothesis testing is that people tend to over-interpret their results, mainly because of a drive to get stronger and more spectacular scientific claims with little work. This problem applies to Bayesian statistics, too.

In a simple situation, let’s say a two-sample comparison in a proper randomized experiment with a single measurement to look up, a hypothesis test is a fairly simple and intuitive and clever idea (“simple and intuitive” are relative of course; it requires to understand probability distributions in the first place, which is tough indeed). Even then people tend to over-interpret the result as proving either alternative or null and this is where the trouble begins. People want strong statements and they don’t want to come out of an analysis saying “we have now done a bit and learned a bit but there is still much uncertainty”. I meet students who come with an attitude from school that if a result is not significant, something has gone wrong and they need to do something in order to get a “better” result. Obviously something goes wrong quite seriously, but it’s the attitude, not the method. With this mindset, there won’t be a good Bayesian analysis either.

Christian:

I agree. I think null hypothesis significance testing is problematic, whether it’s done using p-values or Bayes factors.

“People want strong statements and they don’t want to come out of an analysis saying “we have now done a bit and learned a bit but there is still much uncertainty”.”

I agree. Indeed, when I give a continuing ed course or guest lecture on Common Mistakes in Using Statistics, I list “Expecting too much certainty” at the top of the list. (See the website linked to my name for elaboration.)

http://xkcd.com/1478/

+1

And then there’s “approaching significance,” which conjures up images of a little significance creature crawling up to the significance line.

Martha: yes they should ban the term along with “trending” to significance. It’s rather trending insignificant.

I think that it would be a mistake to just not teach standard frequentist methods. They are the standard currently, and we cannot graduate students who don’t know anything about these methods (the situation is bad enough as it is, do we want people to understand frequentist tools even less than they do?). It’s better to work slowly and steadily to teach Bayesian methodology (which involves much more than running one line of code, unlike lm(), lmer(), or aov()). Teaching and learning Bayesian methods is a huge investment of time for everyone involved.

Hopefully there will be a phase transition once enough knowledgable users enter the field. Stan has been a game changer at least in psycholinguistics (if I had been forced to use WinBUGS, and if Martyn Plummer hadn’t created JAGS, I would never have gotten into Bayesian methods). It will take time for Stan to percolate down.

People like Wagenmakers are relentlessly demonstrating the problems with frequentist methods, and that’s good and useful; that’s also creating awareness. But actual change should be implemented from the bottom-up, not as a top-down directive; that would never work.

Bayesian methods are also going to lose a lot of potential end-users due to the insistence of Andrew to not do hypothesis testing, and his calling things like Bayes factors “crap”. Andrew’s criticisms would have some force if he were an active experimental researcher doing planned experiments. But as things stand, with Andrew never having done a planned experiment in his life (well, maybe he has, but not many), criticizing researchers who are doing planned experiments to test specific hypotheses from the sidelines actually damages the broader goal of increasing awareness of and real, practical facility in using Bayesian methods.

> Bayes factors “crap”. Andrew’s criticisms would have some force if he were an active experimental researcher doing planned

> experiments.

Well I have been involved in experiments of various kinds for longer than I want admit, and I think the problems with Bayesian testing are serious (e.g. priors on nuisance parameters) and not solved to any wide consensus yet. Because of this, I stick with intervals when doing/talking about Bayesian analysis. Unlike Andrew, though I think like Don Rubin, Fisher’s strict null test will make sense in well done experimental settings.

I believe its wrong to promote Bayesian testing _because its Bayesian_ and folks want to do testing until the problems have been sorted out and this is settled in a wide enough consensus.

Keith, what do you think of Kruschke’s interval estimation + region of practical equivalence framework as an alternative to testing?

[…] Posted on January 27, 2015 | Leave a comment “What then should we teach about hypothesis testing?” – Statistical Modeling, Caus… […]

[…] between Vasishth and Gelman particularly crystallises the issue for practising analysts. It came back a couple of weeks later; I particularly like the section about a third of the way down after […]