Frank Harrell shares this horror story:

In speaking with PhD students at various graduate programs, it has become clear that those who are not exposed to Bayesian or the likelihoodist schools simply do not understand type I error, p-values, and hypothesis tests.

I asked a group of biostat PhD students at a famous program: “What is the conclusion of a clinical trial where p=0.6?” and they answered “the treatment doesn’t work.”

We, the statistics profession, have to take responsibility for this disastrous state of understanding.

Is that p = .6 or p = .06? I’d guess that in most decently powered trials, p=.6 would at least strongly suggest a much smaller effect than you think is worth testing for.

It’s p = 0.6, just like it says in the post!

Even if “well powered” to detect some minimal effect of interest the calculations are usually highly optimistic so well powered may not be as well powered as thought. I am often surprised by people’s obsession with a pre study probability when they can look to the interval estimate to see how much precision there is. All p = 0.6 means is that the data are not particularly surprising and somewhat consistent with the assumptions of the model

“….assumptions of the model.”

but which model?

The model used to compute p and the interval estimate and the assumptions that accompany them, no systematic error (sparse data bias, programming errors), test hypothesis (often the null) is true, random mechanism of some sort at play etc etc, see paper by Greenland & me for discussion on this

https://arxiv.org/pdf/1909.08583.pdf

The purpose of my comment was to point out that in many cases people don’t even know what this model is (it’s hidden underneath some statistical software) or that they accept that the model is a relevant and important thing to be referencing without really thinking about it.

I can make *every* clinical trial a p=0.6 scenario if we test the mean against normal(0,322e9)

why would we test against that model? Well, in many ways we should ask the same question of *any model* we test against… why did we test against *that* model? There should be a specific reason other than “the people who wrote the FOO package thought these assumptions were good”

Yes, I agree with you here. Unfortunately, much of this stuff is never mentioned in any standard textbooks for statistics and especially for other disciplines. For most definitions of p values there is absolutely no mention of a statistical model and its other assumptions, just that the main and only assumption is that the null hypothesis is true, so then all these students think that p values are only about null hypothesis and whether they’ve been rejected via a test or not

Of course, p values *are* about “null hypothesis and whether they’ve been rejected via a test or not” (I mean, if you define whatever sampling model you have as a “null” then the p value tells you whether it’s unlikely to have generated the data or not).

The issue is that everyone is implicitly taught that there is a *very specific* well-known “null” hypothesis which is a well defined thing in all cases, automatically has relevance to the research question, and can be definitively checked by reference to the NIST Handbook of Official Tests of Null Hypotheses.

Daniel, yes p values are associated with null hypothesis because of history and tradition but they are not exclusively limited to testing the null, they can be computed for any alternative hypotheses of interest which is why Greenland et al 2016 was so careful in wording it as “probability of test statistic at least as extreme as observed assuming that every model assumption were correct” and why in Chow & Greenland part I, we write the same above but with the addition of “assuming that the target test hypothesis and every model assumption were correct”. This is what Sander and I have been trying to push with computing p values for alternatives, reframing them as bits of information and using graphical functions like confidence/consonance distributions to see all P values and compatibility of parameter values

I agree that we should stop calling things “null” and just call them “hypothetical random number generators” or something.

Not surprised, last year I mentioned to 5 recent new recruits with biology grad degrees that is was a common misconception that the p-value was the probability the null hypothesis was true. All 5 from different universities quickly responded – that’s what our stats prof told us it was!

This raises two possibilities: Either the new recruits did not understand what their stats profs said (in which case their stats profs didn’t test the students well enough), or the stats profs didn’t understand what they were talking about (which might have been because they were not really statisticians, but biologists who either knew more about stats or were more self-confident teaching it than their colleagues). Another possibility contributing factor is that the stats textbook was poor quality (a lot of them are — and often the worst ones are most popular because “they make it easy” — but by making it wrong.)

I think that there is also something about the symmetry between “low p value (i.e., low relative to .05) is evidence for the alternative hypothesis, high p value is evidence for the null hypothesis” that is just too tempting for people’s minds. There may be a strong tendency to prefer (seemingly) simpler explanations and they are easier to remember.

“There may be a strong tendency to prefer (seemingly) simpler explanations “

I think this is a very common part of human nature. Unfortunately, if something involves statistical inference, it involves complexity — and, in particular, uncertainty. A lot of people are complexity-averse, and a lot of people are uncertainty-averse. So teaching (and learning) statistical inference involves going against those two strong “instincts”. Mother Nature is not on our side.

True, but I do know one of the universities did have that on the answer key for the final exam in intro stats and the text at that university introduced the p-value with a correct definition that it morphed into something like essentially the p-value was the probability the null hypothesis was true.

This seems like a much worse mistake than the original one. The difference between Pr(H_0) and Pr(Data|H_0) is conceptually rather large.

Worth noting, perhaps, that Pr(H_0) is Bayesian, because it say that a hypothesis has a probability. Also, that Pr(Data|H_0) is not right, either, because non-data are also included, as a rule. And, as has been mentioned, that H_0 is not the only hypothesis in the givens.

What did they say when you told them it’s an even more common misconception that p-values are generally useful?

I’d bet the Nobel prize winners using p-values would disagree with you.

http://www.statisticool.com/nobelprize.htm

Justin

I guess the conclusion is “Good luck getting that sucker published”.

During the course of my PhD I received zero training on statistics, frequentist or otherwise. Zero.

It was either assumed that my undergraduate degree had covered this sufficiently (it hadn’t) or that I would pick up the right tools as I went along.

I agree that this is disastrous, but is it at all surprising? Maybe you all talk to far fewer life sciences grad students (or faculty) than I do…

I don’t think it has to do with lack of exposure to Bayesian or “likelihoodist” (I don’t know what that means) schools, but rather a lack of exposure to actually *thinking* about statistics rather than memorizing silly rules. It’s sad to see how many bright biology students, and how much potentially good science, has been damaged by inept teaching.

+1

I attended a biology “journal club” type seminar for a few years, and of course often brought up objections to misuses of statistics, so got put on a hiring committee for someone who could teach bio stats. One candidate was just miserable in his understanding of basic stats. But at least he realized quickly that he was flunking the job interview.

What makes all of this so shocking is that their careers *literally* depend on the p-value concept, yet very few of them even know what it means.

Good point.

Their career depends on them not knowing or at least acting the same way as everyone else, who act as though the ritual has meaning. Is it any wonder most just accept the ritual?

My understanding is that many in the hard sciences don’t use inferential statistics, per se.

As is mine. My comment referred to the life sciences (and, I think, social sciences) mentioned in Raghuveer’s post.

Right. I also just realized that the main post referred to biostat PhD’s, not biology PhD’s, so your observation definitely applies!

Are climate scientists hard scientists? See http://www.realclimate.org/index.php/archives/2020/03/why-are-so-many-solar-climate-papers-flawed/

There are some for sure. High energy physics comes to mind

http://www.pp.rhul.ac.uk/~cowan/stat/cowan_munich16.pdf

As does some work on quantum computing and work by Nobel prize winners

http://www.statisticool.com/quantumcomputing.htm

http://www.statisticool.com/nobelprize.htm

Justin

So, what is the conclusion of a clinical trial where p = 0.6 ?

Daniel:

One conclusion is that the data provide very little information about the sign of the effect. Another conclusion is that the data are consistent with zero effect.

If someone asked me that question at a random moment, I imagine my top-of-head answer, verbally, would be something like “No evidence of a treatment effect”.

They’d have to give me about three or four minutes and a keyboard to type on if they wanted the standard 70-word fully qualified manuscript-speak.

They failed to reject the test hypothesis assuming that alpha was lower than 0.6. Seems fairly simple to me, it just takes a lot of practice to fully understand it enough to be able to come up with correct short responses that aren’t bad shortcuts

Brent:

I think “no evidence of a treatment effect” is a reasonable answer.

I think a lot of people (including me) have a hard time seeing a meaningful difference in language between

“The treatment doesn’t work”

and

“No evidence of a treatment effect”

I believe the statisticians have let us down in this regard, by having such nuanced language that only experts can understand it.

Statistics isn’t the only field guilty of this – but it’s the one we are talking about in this post/blog.

“I believe the statisticians have let us down in this regard, by having such nuanced language that only experts can understand it.”

A big part of the problem is that statistics (like many technical fields) is full of subtle concepts that are easily misunderstood. So it’s not a matter of “having nuanced language that only experts can understand” as it is of convincing students that subtle differences are very, very important in technical subjects. Teachers of statistics and other technical subjects need to devote a lot of attention to this. Unfortunately, often people (including students, teachers, and users of statistics) try to “explain things simply” that can’t be explained simply, because they aren’t simple. In particular, statistical inference is something you do when the context inherently has uncertainty. So you can’t expect to have an answer that is certain. The best you can do is to try to narrow down the degree of uncertainty somewhat (and just how much you can do this depends on the specific context)

An additional complication is that ordinary words are often used in technical subjects to have a technical meaning — understandably causing confusion for students and users. So teachers of technical subjects need to be careful to point this out. My experience (teaching math for many years, then statistics for several years) is that this needs to be pointed out again and again — the tendency for people to think they understand when they don’t is part of human nature.

I once heard a computer science professional (professor?) at a symposium lament that physicists and physicians had gotten this right, while the CS field had not. When faced with a new concept or thing, physicists and physicians give it new name derived from Latin or perhaps Greek roots. That name, being new, brings with it no connotations, so people have to read and use the definition.

By contrast, CS appropriates existing words, and then we confuse ourselves by drawing on the original meaning of the words.

I once had a two day fight with my boss over part of a study that went so far as me wondering if I would resign. It turned out that his terminology (economist) and mine (psychologist) were totally different and meant the same thing.

Language is important.

I think the biggest issue statistics face is that most people using and interpreting statistics are not themselves statisticians.

I am not a statistician, though I do have an interest in the field (obviously I’m reading a statstics blog after all), but I use statistics in my field almost every day. Either running the analyses myself, or interpreting someone elses. I can’t really think of another field like this (English departments maybe?). There are far more scientists who use statistics than graduates of statistics programs at universities, and most science teams i’m familair with do not have a single MS or PhD holding statistician in their midst.

“I believe the statisticians have let us down in this regard, by having such nuanced language that only experts can understand it. “

I disagree here with Martha. The distinction here is important and not a subtle concept for technical experts. The failure is a failure of the education of the general public. What we basically have is quite a fundamental difference. The former is a positive statement about the treatment. The latter is a statement about the nature of the evidence.

To translate it into a different context:

“He’s guilty”

vs

“The evidence doesn’t exonerate him”

These two statements are obviously massively different.

I agree that’s the difference between the statements as I understand it. But isn’t it true that, if the test is well-designed and therefore powerful enough to detect relevant effect-sizes, that not only does the evidence not exonerate him, but it is in fact also evidence of his guilt? Is the point that is being made here (in this post and discussion) that p=0.6 means that the evidence indicates that either the treatment doesn’t work, or the test isn’t well designed to detect that? And that we should not lose track of the second possibility?

We don’t entirely disagree. I agree that there are also problems with the education of the general public, but when we teach technical subjects, we need to go beyond what is in the education of the general public.

For example, I believe that one thing that teachers of statistics need to do is to compare their grammatical usage with similar grammatical usage in a different context — because use of technical terms can make recognizing the grammatical aspects more difficult, especially at first. So, for example, a good teacher of statistics should do precisely the type of “translation” that Zhong Fan gives: point out the grammatical similarity of the contrast between

“The treatment doesn’t work” and “No evidence of a treatment effect”

and the contrast between

“He’s guilty” and “The evidence doesn’t exonerate him””

Mathijs said:

“…. But isn’t it true that, if the test is well-designed and therefore powerful enough to detect relevant effect-sizes, that not only does the evidence not exonerate him, but it is in fact also evidence of his guilt? Is the point that is being made here (in this post and discussion) that p=0.6 means that the evidence indicates that either the treatment doesn’t work, or the test isn’t well designed to detect that? And that we should not lose track of the second possibility?”

I think you’re on the right track — but it’s important to recognize and emphasize that “well-designed” cannot be understood to mean “perfect”. Experimental design is inherently limited by our lack of knowledge of the real world. We can do our best to design and carefully carry out an experiment, but our best still involves assumptions, so our results are always contingent on those assumptions. In short, uncertainty is always with us.

I have a feeling that there is a nice (and perhaps memorable) way teach someone about these nuances:

Suppose I have a gift box with a nice ribbon tied around it. Now I ask you if you think there is a present inside. You say you don’t know (or perhaps you guess, and if you do and you can articulate a reason for your guess, that’s a nice example of a prior and we talk about that some more).

Next, I give the box to you and ask again, do you think there is a present inside? Most people would probably shake the box (i.e., do an experiment). Now suppose that the box feels very light and when shaken you don’t hear or feel anything moving around in it.

At this point, it would be wrong for you to say that you know that there is no gift in the box. If you say that, we can talk for a little while about how you think you can know this and how certain you can be about your statement. This may help to appreciate the difference between “X doesn’t work / There is no gift” and “No evidence for a gift”.

In the end, I allow you to open the gift box. It reveals a nice and very light silk handkerchief, which is yours to keep.

You can more concretely state it as “The data is too sparse or too noisy to say anything” or “the effect is too small to detect given the level of noise in the data”.

And conversely p<0.05 means something like:"There is a barely discernible signal in the noise, but we don't know how big is, could it just be a small measurement bias?".

Or, as Mayo would say, the hypothesis has not been severely tested, so it has received only weak support.

Tom:

I’m happy working “severe testing” into any discussion of p-values. When discussing a statistical method, it’s important to think about not just its mathematical foundations and not just about what it does in practice, but also about the real-world problem it’s intended to address. In the case of p-values, severe testing is part of the goal.

Tom:

If this is a genuine high P-value, there’s poor evidence for a genuine effect. It may warrant inferring an upper bound for the population effect (with a severity = the corresponding confidence level)

> If this is a genuine high P-value, there’s poor evidence for a genuine effect.

A “genuine high p-value”? Every p-value is genuine assuming the assumptions the calculation was derived from are all true, and not genuine if any are false. Almost all uses fall in the second category.

By “genuine” do you mean “free from p-hacking”?

This is the problem I have had all along with severe testing. As a general concept (to cover measurement issues, model assumptions, statistical analyses, etc.) I think it is great. There are many ways a claim can be said to have failed to undergo “severe testing.” However, when confined to a specific analysis resulting in a particular p-value, I have been less enthusiastic. The examples in Mayo’s book seem like the latter to me. Once many of the myriad issues are assumed away, the remaining use of the p-value seems to comport with its standard (meaning generally accepted by those that use p values) meaning, and severe testing just seems like a reiteration of the textbook meaning of p values (and I don’t have a problem using p values though I reject the use of any particular significance cutoff level). But the whole appeal of severe testing to me is the idea that the concept can cover the spectrum of issues associated with collection of data and its analysis. What I found wanting in Mayo’s book was the lack of any specific guidance that covers those general issues.

So, if a p-value is 0.6, I’d like to ask if it has been severely tested. If any of the issues we think about (as in Andrew’s comment just above) are relevant, then I am reluctant to infer much (if anything) from the p-value. I think those that would say the p value of 0.6 tells us something, must be making a number of assumptions about the study that resulted in that value. And it is precisely those assumptions that seem to concern much of our attention.

Andrew, already you are assuming that the test is a test of difference in means, compared to a null of 0…

What if the test was a chi-squared test of goodness of fit of a theoretical histogram to an actual histogram… We’d conclude that our drug is performing exactly as expected and is very effective!

An innocent question from a non-statistician: why is “It doesn’t work” such a bad summary of “The data are consistent with zero effect”? The most meaningful distinction I can see is the second statement still allows “The data are consistent with a large effect also”. I suppose this cannot be directly decided from the p-value, but, if I am not mistaken, if the test is well designed so that it is powerful enough to detect the effect (size) of interest, then the data with such a high p-value are not consistent with a large effect.

To put it another way, what kind of mistake am I allowing myself to make or am I more likely to make if I use statements like “It doesn’t work”?

Mathijs:

1. The treatment can work, it just might be that the experiment is too noisy to show it.

2. The treatment can work for some people but not others.

See also this article.

That’s of course just a subset of situations. There’s also “there’s 200 other studies showing the treatment working, this one is most likely a fluke”.

Also:

3. The treatment can work for virtually everyone in the population, and the study and measures can be well-designed, and we can get p = .6 by chance.

Another non-statistician here.

So, would a simple substitution of “didn’t” for “doesn’t” help at all here?

“The treatment doesn’t work” suggests a conclusion with some degree of finality, and implies it will not work in the future.

“The treatment didn’t work” only speaks to treatment effect in that specific clinical trial, which (assuming sufficient power and competence of design/implementation of the clinical trial) does not necessarily speak to future trials.

This is a very good question, and (as noted elsewhere here also) gets to what I think is the key issue: there’s a very big difference between “it doesn’t work” and “we can’t tell if it works or not.” Saying the former when one should say the latter seems minor, but it has a huge effect on how people think. Time after time, I’ve seen people pounce on small effects with small p-values as “the mechanism” behind something, ignoring and even dismissing large effects with p>0.05 rather than taking that as a cue to better measure the latter.

But the question we’re really “testing” isn’t “Does it work or not?”, but something more like, “Does it work a large enough percentage of the time?” or “Does it work for a large enough percentage of the population in question,” etc.

Raghu:

This is an excellent point, which was not raised elsewhere in this discussion.

Declaring an effect as zero can be taken as evidence that some alternative explanation is true. This is a huge problem, and it happens all the time in so-called robustness studies. The placebo-control analysis is not statistically significant, hence the researcher concludes that the preferred story is correct.

So far we have a statement misinterpreting the p-value, “The treatment doesn’t work,” and a correct, though brief, interpretation, “No evidence of a treatment effect.” In my experience, most interpretations of significance in the literature are written as “There is no treatment effect.” Hence, are we saying that if we add “evidence of a” in between the “no” and “treatment” then we’re in the clear? Is the distinction between “good” and “bad” interpretations that subtle? Based on my reading of this blog and papers mentioned herein, I had thought it was critical to add the caveat that Mathijs mentions: “The data are consistent with a large effect also” or, at least, researchers should keep the caveat in mind when writing their results and discussion sections.

To clarify my above point about adding the caveat “The data are consistent with a large effect also”. I am assuming we have the parameter estimate and confidence intervals to be able to say something about the range of effect sizes. I understand that a p-value alone tells us nothing about effect sizes.

To move from “we cannot tell from the results in this study that such and such works”

to

“it doesn’t work”

is a glaring fallacy. For one thing, the study may have had no capability of uncovering that it works even if it does.

Instead of “the treatment doesn’t work” you could say “the treatment doesn’t seem to work”. Things are not always what they seem and keeping the distinction in mind can be helpful.

The conclusion is It Depends.

Exactly. And what does it depend on? About 99.7% of all people don’t know.

The premise that the p-value is sufficient information is already so wrong. It isn’t even relevant information.

I’ve been totally ignoring all p-values for years and seem to have a better eye for interesting data than the people publishing it…

One issue I have seen in teaching is that some textbooks, even those written by statisticians, will contain unclear descriptions of confidence intervals and / or significance tests. For instance, the standard language: we are 95% confident the true value is within this interval makes the statement seem to be about the calculated interval rather than a long-run property of the estimator. I suppose the excuse for this is that this language is just short-hand for the, admittedly, long-winded full description but students do not know this. One thing I have learned from my own teaching is that if the instructor deviates from the textbook, this causes some anxiety among the students. I gave students my own description of the meaning of a confidence interval. Many students wrote on assignments and exams: the textbook says X, the professor says Y so as to hedge their bets. So, one big thing we need to do is to put pressure on the textbooks to reform. When a new textbook comes out, whether written by a statistician or a scholar in another field, there should be reviews of it published that pay special attention to its terminological strengths and weakness. One possibility going forward is the ASA, CSA, RSS, etc could set up a repository of reviews of introductory statistics textbooks as a resource to instructors. The harder problem would be raising awareness of such things among those who don’t follow the statistics world closely.

The example is particularly dismaying to me as it comes from bio-stats students. If this were a program in economics, psychology, biology, or any number of sciences, this result, while not dismaying would not be surprising. I would have thought that bio-statisticians would be a bit more careful.

Your comments here fit well with my three comments above — even when biologists are used to technical meanings of terms in biology, they may be oblivious to the necessity of using and understanding technical meanings in statistics. ( would guess that part of this is that often biological names are in Latin, or are not words that have “ordinary” meanings — although there are still a fair number of the latter types of situations, where an “ordinary” word also has a technical scientific meaning.)

Might be difficult getting that review published. And regularly publishing them would probably make enemies. You may be, with noble intent, putting the textual cart before the paradigmatic horse.

That’s true though it might be easier to do this for the books published by non-statisticians. I would hope these are where more of the errors are and that it will be easier to convince statisticians to clean up the language. I think the issue is that we are so accustomed to short-hand jargon and its longer meaning that we have sometimes forgotten how the correct description is not obvious. In any case, if this problem extends even to bio-statistics, then it is worse than I had thought.

In any case, practically all introductory statistics courses are geared toward getting students running tests and drawing conclusions by the most direct way possible. In this way, we reinforce the idea that statistics is for making decisions rather than modeling uncertainty. It seems to me that a properly taught statistics course has as much in common with a composition course as with a mathematics course. Precision and correctness of language is no less essential than correct calculation procedures. In this regard, another thing working against us is that evaluating the former takes a great deal more time than the latter which can easily be automated.

Agreed. Even many stats professors teach statistics to students in other sciences as a set of software features. In your terms, the statistics in a paper are not the numbers–they’re they paper.

John:

Official guides and best practice manuals are increasingly getting basic definitions wrong.

https://errorstatistics.com/2019/09/30/national-academies-of-science-please-correct-your-definitions-of-p-values/

To their credit, the National Academies of Science did fix the most glaring cases of erroneously defined P-values, after I wrote to them.

I would ask for the set of responses and the question actually asked. This seems like a twitter quote (not linked for some reason)?

The whole point of this blog has seemed to me to be to talk about poor usage of statistics in general, specific, and/or common usage, but you take a 4 word quote from a second party summarizing a third, undefined and anonynomized group, response as the foundation for a post. Cool Cool fine, but I’ll take it with the same mountain of salt I take every 5 word summary.

Show me the question he asked. Show me the answers. That windmill will still be there if you do so and we will tilt it together.

Arbortonne:

It’s not a twitter quote. There’s no link because it came from an email. It’s an anecdote. Make of it what you will. The point of this blog is to discuss statistical modeling, casual inference, and social science. Sometimes an anecdote can lead to a good discussion.

Somewhat relatedly, the other day I pointed my students to a study ( https://osf.io/z7kq2/ ) that found 89% of intro psychology textbooks define or describe statistical significance testing incorrectly. I personally doubt that fully 89% of the authors of psychology textbooks do not the precise definitions of those terms or were too lazy to look the definitions up in a reputable source. Maybe 50%. But some non-negligible percentage are writing things in their textbooks that they know are not precisely true because they (perhaps rightly) do not think their audience can handle the actual definitions. So, they say something in shorthand based on how p-values are used in practice rather than how they are supposed to be interpreted. And this perpetuates the bad practices. But I imagine it is hard to write (or at least to sell) an intro psych textbook that ignores significance testing and it is even harder to sell an an intro psych textbook that treats it rigorously.

“But I imagine it is hard to write (or at least to sell) an intro psych textbook that ignores significance testing and it is even harder to sell an an intro psych textbook that treats it rigorously.”

+1

And that may also be true for intro stat books for other fields.

Maybe we need to let the other fields think that *they* discovered the problem with p-values, showed those statisticians what’s what.

Maybe they know or sort of know the technical definition, but they also know that in practice p-values are used in the way they describe (even if they shouldn’t be), and so just use the latter description?

Also, is there any quick way for me to guesstimate how often this incorrect understanding would lead to “bad decisions”, assuming some reasonable operationalizing of “treatment doesn’t work”? It feels like there should be if I make a couple quick assumptions, but I’m still new to stats.

I think when you refer to making “bad decisions” you mean “false conclusions” or “unacceptable conclusions.” The probability of the former is 1 – .6 = .4, or 40% of the time, assuming you accept the simplification that treatments either do or don’t work. The probability of the latter depends on your audience and what they are willing to accept. I suspect the institution funding your research would object to your conclusions 0% of the time.

It means that your paper’s title should begin with “Scientist’s new sure-fire way to…”

Under what circumstances would concluding “The treatment didn’t work” be an appropriate summary of clinical trial results?

Surely there do exist “treatments” which are entirely ineffective. If one of these were meant to be studied, shouldn’t there be some sort of experiment or trial which could convincingly demonstrate that true non-effect?

A philosopher, a statistician, and a biologist were travelling through Scotland when they saw a black sheep through the window of the train.

‘Aha,’ says the biologist, ‘I see that Scottish sheep are black.’

‘Hmm,’ says the statistician, ‘You mean that some Scottish sheep are black.’

‘No,’ says the philosopher. ‘All we know is that there is at least one sheep in Scotland, and that at least one side of that one sheep is black!’

Biologist: write many papers obtains fame for discovering new endangered species

Statistician: inspired to start a blog about how poorly scientists understand statistics, recognized as expert, obtains some notoriety

Philosopher: dies in a homeless camp under freeway in abject poverty

Jim:

Just to be clear, I’m much more interested in how to do statistics well than in how to do it poorly. But it can be useful to study how it is done poorly, in order to understand how to do it well. I’ve learned a lot from all these discussions and bad examples over the years. Even in this thread, I learned something; see here.

I’ve also spent lots of time over the year responding to misconceptions in blog comments. That might be a waste of time, but I think of these discussions as practice for more formal explications.

“I’m much more interested in how to do statistics well than in how to do it poorly. “

yes, that is true, the above was facetious.

“I’ve also spent lots of time over the year responding to misconceptions in blog comments. That might be a waste of time”

Definitely not a waste of time. I’ve learned a lot here. I like the debate. Many others also who’s contributions I look forward to reading.

That reminds me of a scene in Heinlein’s “Stranger in a Strange Land”. One character is what is called a “Fair Witness” – a highly trained observer whose statements about what he/she witnessed can be taken as fact in a court of law.

She is asked “What color is that house over there?”. The answer – “It’s white on this side.”

Just stating what the data show, no more.

Nice!

An interesting question. I think you’d require several components:

1. A good (predefined) definition of what works/doesn’t work means, for example a definition of practically significant effects.

2. A defined acceptable level of uncertainty. For example “if in the posterior we think the value is in the null interval with 90% probability we think it’s ineffective” – or some frequentist analogue.

3. Rigorous checking to rule out model misspecification, con-founders etc.

4. A plausible scientific theory *why* the treatment might not work.

Others might consider more elements…

Brent:

You can never convincingly demonstrate a zero effect from the data alone. At best you could convincingly demonstrate that the data are inconsistent with any large effect, or any nontrivial effect. To do this you’d want an uncertainty interval that only contains trivial effect sizes. The p-value by itself doesn’t tell you this.

So taking it beyond the question of deriving conclusions from a specific set of data…

What soup-to-nuts protocol would be required in order to do a “trial” demonstrating convincingly that a certain ointment does not prevent pimples?

Or rather I mean to ask, is there any conceivable experiment that this group feels can convincingly answer a question of that form with an answer of “it does not work”?

Specify what you mean by “it works”, which should be something like “the overall benefit as measured by U(x) is positive”. Then specify the protocol to be tested, and the statistical assumptions behind the data analysis model in terms of a mechanism, a measurement protocol, a prior for the unknown parameters in the mechanism and measurement protocol, and a probability of observing some data given the parameters, mechanism, and measurement protocol… Then compute the posterior distribution of the parameters, simulate the expected outcome in future uses (the posterior predictive) and show a variety of things like the expected U(x) and the posterior probability that U(x) is positive, as well as the distribution of U(x).

At this point, assuming Expected(U(x)) is positive declare “as far as we know, it’s worth using”, if the p(U(x) > 0) is large (like say 95%) declare “the evidence shows that it probably works”.

But isn’t that true for any effect? You can only make probabilistic statements about drug working or not working, whatever the definition of “working” might be, based on trial data. But in any normal conversation it’s impossible to talk that way.

Michal:

I don’t see why I can’t talk that way in normal conversation. For example, when discussing “power pose” on this blog, I’ve never said that it “doesn’t work” or that it has “no effect.” I’ve consistently said that I expect any effects it has to be highly variable across people and situations, and that various claimed effects in the literature were not supported by the data.

Whether something works is a theoretical consideration, whereas clinical trials address only practical considerations. To show that a purely biological treatment works, you have to break down its mechanisms into known physical processes and observe/record those processes or their definite proxies. To show that a treatment “influences behavior in a particular way with high probability” or some other imprecise construct, you have to take extant knowledge, compose a theory, test the assumptions of the theory, test the predictions of the theory, revise the theory, argue with people who have evidence for other theories–for a few decades–and then wait for everyone you didn’t convince to die off.

Science!

What was the effect size?

There is no need for mental acrobatics over a question with a false premise.

How about this answer: “In the parlance of early-to-mid 20th century statisticians, there is no evidence of a treatment effect. In reality, someone computed an unnecessary statistic, either because they wanted an inflated level of confidence in their conclusions or else because other people expected them to report it.”

To statistical novices: All you need to know about “type I error, p-values, and hypothesis tests” is that a) they add little to no useful information beyond the information already contained in the point estimate and standard error, and b) trying to interpret p-values as if they do contain special information tends to be more misleading than it is informative.

To statistical experts: Our field’s deep need for improved pedagogy is, in this case, a red herring. If you are teaching these things at all, it should be as historical context and/or as a launching point for explaining the replication crisis and other ongoing problems. I remember learning in high school about Lamarckian evolutionary theory as a stepping stone to Darwinism. I remember very little about it, but I hold no grudges against the teacher or the curriculum or the field of professional biologists, as my ignorance does not weaken my practical understanding of natural selection. But if I had been taught Lamarckianism, in equal detail as Darwinism, because “although it’s something many argue you shouldn’t use because it’s wrong, you still have to learn it and be able to apply it because you will be asked to provide it by reviewers and may be required to teach it by your future academic department,” that would have been a huge impediment to my learning Darwinism and keeping it straight.

I agree on the nature of the disease, but I think the most promising cure is not so much in teaching different (e.g. Bayesian) methodology, but rather in teaching statisticians how to make scientific arguments “soup to nuts”. Better methodology is great (and I’m all for teaching more Bayesian methods), but unless we learn to engage more broadly in scientific argumentation, I’m not sure how effective these battles against the “p < 0.05" culture will be.

I've been trying to develop my ideas around this in the following blog posts (apologies if this self-promotion is poor form, but I assuming linking over is better than trying to re-state my arguments here).

https://news.metrumrg.com/blog/a-significant-tilling-of-the-statistical-soil-the-asa-statement-on-statistical-significance-and-p-values-1

https://news.metrumrg.com/blog/statistical-significance-is-a-result-not-a-conclusion

https://news.metrumrg.com/blog/significance-and-directional-inference

With respect to Harrell, the question is very silly.

Far from being specific to p-values, such a silly question could also be asked about any single statistic such as a Bayes factor, posterior probability, or anything else. “What is the conclusion of a clinical trial where BF=2.99?”

Who interprets anything, especially a complex clinical trial, based on just one statistic?

But to toss a modified question back at Harrell or anyone else, what if say four experiments were sound, no QRPs, and we obtained the resulting p-values from them: .63, .23, .4, and .54? Then we may be justified in saying there is no difference in the treatments in the population.

Justin

http://www.statisticool.com/objectionstofrequentism.htm

I don’t feel like you put much thought into this. The p-value is a function of sample size and precision of your measurements. Eg, those could be calculated from four n=3 mechanical turk surveys.

But anyway, no one actually cares whether there is exactly zero difference to begin with.

Justin, your “objections to frequentism” page has some great stuff! I hadn’t seen that before and am looking to spending some time with it.

I agree with your remark that about reasonable people not interpreting any complex clinical trial based on just one statistic, but I do think there is work to be done to increase the population of reasonable people. Within my own small sphere of influence, I am trying to encourage this “reasonable thinking” by emphasizing that :

“Statistical significance is a result, not a conclusion”. (My attempt at pith.)

In other words, I have no problem with citing the significance or non-significance of something in the Results section of an argument, and then telling me, in the context of a Discussion, why you think those results constitute evidence in favor of this or that hypothesis. But then, by the time you get to Conclusions, I really don’t want to see p-values or this or that determination of significance. I just want to see whether, *having assessed the totality of evidence in context*, you conclude “X” or “not X”, or “not enough information at present”.

I’m still sort of early days with this particular way of framing things, but I think it holds the promise of discouraging, as you say, “[interpretation] based on just one statistic”.

No one should look at just one statistic. Also true: some single summaries are better than others. The p-value answers a questions no one actually cares about most of the time, whereas other options like expected utility under posterior uncertainty at least are optimal for a decent class of real-world decision problems.

The p-value is also more importantly a function of the test statistic and the data from your well-designed experiments that have hopefully been replicated and with no QRPs.

Given that people apparently care to test nulls (see https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1497540), you’ve provided evidence against your own hypothesis that ‘no one actually cares’. Maybe no one truly believes a difference is exactly a constant C, where C can be anything, is besides the point that the method works, but having effect = C (or no effect being C = 0) is a good model for modus tollens logic.

A small set of examples where people do care to test nulls:

https://pdfs.semanticscholar.org/37dd/76dbae63b56ad9ccc50ecc2c6f64ff244738.pdf

https://adamslab.nl/wp-content/uploads/2019/05/Confessions-of-a-p-value-lover-20190327.pdf

https://amzn.to/2MABuTn

http://www.statisticool.com/nobelprize.htm

http://www.statisticool.com/quantumcomputing.htm

Try again.

Justin

I said none actually cares if the difference is exactly zero. Obviously 99% of the people doing NHST don’t have the slightest clue what they are actually calculating.

It’s true that reasonable people don’t usually care if the difference is exactly zero. But it’s not uncommon at all to want to know whether the difference is less than zero or greater than zero. And, since a single two-sided hypothesis test of H0: m2 – m1 == 0 vs. HA: m2 – m1 == 0 is operationally equivalent to two one-sided hypothesis tests: H01: m2 – m1 0 and H02: m2 – m1 >= 0 vs. HA2: m2 – m1 < 0 (each performed at alpha / 2 to account for the fact that you are giving yourself two chances to make a mistake). And, because of that equivalence, the directional inference that people do justifiably desire is in fact on offer from point null hypothesis tests. (Though we could certainly find more intelligent ways of talking about what we are doing.)

I doubt this. People care about whether a treatment effect is 0.00001% vs -.00001%?

People care about practical differences, which is going to be specific to their problem. They don’t actually care about what the canned stats routine is telling them, which is why they make up BS about what the output means. They literally can not believe how stupid it is.

Well, the thing is, scientific argument generally involves weaving together multiple lines of evidence that aren’t necessarily commensurable (at least, this has been my experience). For example, if you have one line of evidence (based on in vitro enzymology, let’s say) that suggests that a treatment should (if anything) improve your cognition, and then you have another line of evidence (based on a clinical trial, let’s say) that suggests that the treatment actually decreases your cognition, that disconnect is probably very important.

Those two lines of evidence probably aren’t commensurable, unless you have some really great translational model to convert the magnitudes that you see on one scale to the magnitudes that you see on another scale. But it’s enough — for certain purposes — to realize that those two lines of evidence are directionally inconsistent with each other. That’s enough to tell you to go back to the drawing board and come up with some new hypotheses about what is going on.

Or, to say it another way:

“[The] sign of an effect is often crucially relevant for theory testing, so the possibility of Type S errors [i.e. errors in directional inference] should be particularly troubling to basic researchers interested in development and evaluation of scientific theories.”

Which excerpt I have taken from this very nice paper: http://www.stat.columbia.edu/~gelman/research/published/retropower_final.pdf

What important/useful theories were tested in this way? Usually there is some magnitude expected according to the theory, not a negligibly small effect.

Most of the examples I know have to do with pharmaceutical drug development. A lot of the examples fit the pattern that I sketched above. Just to pick one of the more notable examples, consider what happened with Pfizer’s development of torcetrapib. They knew that it boosted HDL cholesterol. Based on that and everything else they knew, that drug should have improved cardiovascular outcomes and decreased all-cause mortality. Then the data came back from one of their interim analyses that the drug was in fact causing worse cardiovascular outcomes and increasing all-cause mortality. All they needed to know was that they were moving in the wrong direction; the drug was making worse something that it was supposed to make better. They didn’t need to know how far they were moving in the wrong direction. Just knowing the direction was “the wrong way” was enough to cancel the entire program overnight. The fact that those “in the wrong direction” results didn’t add up prompted further research, and from that we learned more about the aldosterone pathways (turns out torcetrapib was boosting aldosterone, a fact that no one had anticipated or been able to detect previously).

Gelman and Carlin, the authors of the paper that I was quoting there, presumably have other examples.

> Based on that and everything else they knew, that drug should have improved cardiovascular outcomes and decreased all-cause mortality.

So they devoted millions of dollars to a drug they thought might improve cardiovascular outcomes an arbitrarily small amount?

Sorry, but I don’t believe that. I think they expected some substantial, clinically meaningful improvement.

Typo correction (or maybe my “less than” sign got gobbled up as some sort of html tag?)

What I meant was (this time re-writing in quasi-latex):

“… operationally equivalent to two one-sided tests: H01: m1 – m2 leq 0 vs. HA1: m1 – m2 gt 0 and H02: m1 – m2 geq 0 vs. HA2: m1 – m2 lt 0 … “

Justin

Your examples listed above show a confusion of issues. I realize some people are totally against p-values: I am not, although I am against the use of any particular p value as the threshold for decision making. One of your examples considers the difference between a p value of 0.049 and 0.051, saying that while the difference is immaterial, some standard must be used (as with differentiating between those who pass and those who fail a competency exam). But the real issue is whether that 0.05 standard should be applied to all decisions that must be made on the basis of available evidence. When a dichotomous decision must be made, I am not against looking at the p value, but surely the threshold that is chosen should be based on a careful consideration of costs and benefits associated with each decision. The use of 0.05 for all decisions seems ludicrous to me.

Most of the other references are tied to the fact that many researchers, including famous and prized ones, use NHST. That is hardly a compelling justification – it is an excuse for things changing very slowly. Why waste energy fighting against an entrenched methodology? My answer: because we can do better. I am not advocating abandoning the traditional evidence as doing better (though I do think NHST and p values is better left behind) – confidence intervals, along with a careful description and critique of methodology and careful decision analysis comprises what I think would be doing better. The problem, as I see it, with NHST is that it easily slips into a rigid threshold for the p value. If NHST properly accounts for the costs and benefits associated with alternative decisions, then the null hypothesis and p value add nothing – and actually provide less – than the confidence interval. I realize that confidence intervals and p values are two sides of the same coin – but only when misused to make rigid decisions. If we abandon the idea that the statistical evidence should result in a declaration that an effect is real or that there is insufficient evidence to conclude so, then I think confidence intervals are quite useful.