I think the biggest issue statistics face is that most people using and interpreting statistics are not themselves statisticians.

I am not a statistician, though I do have an interest in the field (obviously I’m reading a statstics blog after all), but I use statistics in my field almost every day. Either running the analyses myself, or interpreting someone elses. I can’t really think of another field like this (English departments maybe?). There are far more scientists who use statistics than graduates of statistics programs at universities, and most science teams i’m familair with do not have a single MS or PhD holding statistician in their midst.

]]>Another non-statistician here.

So, would a simple substitution of “didn’t” for “doesn’t” help at all here?

“The treatment doesn’t work” suggests a conclusion with some degree of finality, and implies it will not work in the future.

“The treatment didn’t work” only speaks to treatment effect in that specific clinical trial, which (assuming sufficient power and competence of design/implementation of the clinical trial) does not necessarily speak to future trials.

]]>Justin

Your examples listed above show a confusion of issues. I realize some people are totally against p-values: I am not, although I am against the use of any particular p value as the threshold for decision making. One of your examples considers the difference between a p value of 0.049 and 0.051, saying that while the difference is immaterial, some standard must be used (as with differentiating between those who pass and those who fail a competency exam). But the real issue is whether that 0.05 standard should be applied to all decisions that must be made on the basis of available evidence. When a dichotomous decision must be made, I am not against looking at the p value, but surely the threshold that is chosen should be based on a careful consideration of costs and benefits associated with each decision. The use of 0.05 for all decisions seems ludicrous to me.

Most of the other references are tied to the fact that many researchers, including famous and prized ones, use NHST. That is hardly a compelling justification – it is an excuse for things changing very slowly. Why waste energy fighting against an entrenched methodology? My answer: because we can do better. I am not advocating abandoning the traditional evidence as doing better (though I do think NHST and p values is better left behind) – confidence intervals, along with a careful description and critique of methodology and careful decision analysis comprises what I think would be doing better. The problem, as I see it, with NHST is that it easily slips into a rigid threshold for the p value. If NHST properly accounts for the costs and benefits associated with alternative decisions, then the null hypothesis and p value add nothing – and actually provide less – than the confidence interval. I realize that confidence intervals and p values are two sides of the same coin – but only when misused to make rigid decisions. If we abandon the idea that the statistical evidence should result in a declaration that an effect is real or that there is insufficient evidence to conclude so, then I think confidence intervals are quite useful.

]]>> Based on that and everything else they knew, that drug should have improved cardiovascular outcomes and decreased all-cause mortality.

So they devoted millions of dollars to a drug they thought might improve cardiovascular outcomes an arbitrarily small amount?

Sorry, but I don’t believe that. I think they expected some substantial, clinically meaningful improvement.

]]>Most of the examples I know have to do with pharmaceutical drug development. A lot of the examples fit the pattern that I sketched above. Just to pick one of the more notable examples, consider what happened with Pfizer’s development of torcetrapib. They knew that it boosted HDL cholesterol. Based on that and everything else they knew, that drug should have improved cardiovascular outcomes and decreased all-cause mortality. Then the data came back from one of their interim analyses that the drug was in fact causing worse cardiovascular outcomes and increasing all-cause mortality. All they needed to know was that they were moving in the wrong direction; the drug was making worse something that it was supposed to make better. They didn’t need to know how far they were moving in the wrong direction. Just knowing the direction was “the wrong way” was enough to cancel the entire program overnight. The fact that those “in the wrong direction” results didn’t add up prompted further research, and from that we learned more about the aldosterone pathways (turns out torcetrapib was boosting aldosterone, a fact that no one had anticipated or been able to detect previously).

Gelman and Carlin, the authors of the paper that I was quoting there, presumably have other examples.

]]>[The] sign of an effect is often crucially relevant for theory testing

What important/useful theories were tested in this way? Usually there is some magnitude expected according to the theory, not a negligibly small effect.

]]>No one should look at just one statistic. Also true: some single summaries are better than others. The p-value answers a questions no one actually cares about most of the time, whereas other options like expected utility under posterior uncertainty at least are optimal for a decent class of real-world decision problems.

]]>Typo correction (or maybe my “less than” sign got gobbled up as some sort of html tag?)

What I meant was (this time re-writing in quasi-latex):

“… operationally equivalent to two one-sided tests: H01: m1 – m2 leq 0 vs. HA1: m1 – m2 gt 0 and H02: m1 – m2 geq 0 vs. HA2: m1 – m2 lt 0 … “

]]>Or, to say it another way:

“[The] sign of an effect is often crucially relevant for theory testing, so the possibility of Type S errors [i.e. errors in directional inference] should be particularly troubling to basic researchers interested in development and evaluation of scientific theories.”

Which excerpt I have taken from this very nice paper: http://www.stat.columbia.edu/~gelman/research/published/retropower_final.pdf

]]>Well, the thing is, scientific argument generally involves weaving together multiple lines of evidence that aren’t necessarily commensurable (at least, this has been my experience). For example, if you have one line of evidence (based on in vitro enzymology, let’s say) that suggests that a treatment should (if anything) improve your cognition, and then you have another line of evidence (based on a clinical trial, let’s say) that suggests that the treatment actually decreases your cognition, that disconnect is probably very important.

Those two lines of evidence probably aren’t commensurable, unless you have some really great translational model to convert the magnitudes that you see on one scale to the magnitudes that you see on another scale. But it’s enough — for certain purposes — to realize that those two lines of evidence are directionally inconsistent with each other. That’s enough to tell you to go back to the drawing board and come up with some new hypotheses about what is going on.

]]>Justin, your “objections to frequentism” page has some great stuff! I hadn’t seen that before and am looking to spending some time with it.

I agree with your remark that about reasonable people not interpreting any complex clinical trial based on just one statistic, but I do think there is work to be done to increase the population of reasonable people. Within my own small sphere of influence, I am trying to encourage this “reasonable thinking” by emphasizing that :

“Statistical significance is a result, not a conclusion”. (My attempt at pith.)

In other words, I have no problem with citing the significance or non-significance of something in the Results section of an argument, and then telling me, in the context of a Discussion, why you think those results constitute evidence in favor of this or that hypothesis. But then, by the time you get to Conclusions, I really don’t want to see p-values or this or that determination of significance. I just want to see whether, *having assessed the totality of evidence in context*, you conclude “X” or “not X”, or “not enough information at present”.

I’m still sort of early days with this particular way of framing things, but I think it holds the promise of discouraging, as you say, “[interpretation] based on just one statistic”.

]]>But it’s not uncommon at all to want to know whether the difference is less than zero or greater than zero.

I doubt this. People care about whether a treatment effect is 0.00001% vs -.00001%?

People care about practical differences, which is going to be specific to their problem. They don’t actually care about what the canned stats routine is telling them, which is why they make up BS about what the output means. They literally can not believe how stupid it is.

]]>It’s true that reasonable people don’t usually care if the difference is exactly zero. But it’s not uncommon at all to want to know whether the difference is less than zero or greater than zero. And, since a single two-sided hypothesis test of H0: m2 – m1 == 0 vs. HA: m2 – m1 == 0 is operationally equivalent to two one-sided hypothesis tests: H01: m2 – m1 0 and H02: m2 – m1 >= 0 vs. HA2: m2 – m1 < 0 (each performed at alpha / 2 to account for the fact that you are giving yourself two chances to make a mistake). And, because of that equivalence, the directional inference that people do justifiably desire is in fact on offer from point null hypothesis tests. (Though we could certainly find more intelligent ways of talking about what we are doing.)

]]>I said none actually cares if the difference is exactly zero. Obviously 99% of the people doing NHST don’t have the slightest clue what they are actually calculating.

]]>This is the problem I have had all along with severe testing. As a general concept (to cover measurement issues, model assumptions, statistical analyses, etc.) I think it is great. There are many ways a claim can be said to have failed to undergo “severe testing.” However, when confined to a specific analysis resulting in a particular p-value, I have been less enthusiastic. The examples in Mayo’s book seem like the latter to me. Once many of the myriad issues are assumed away, the remaining use of the p-value seems to comport with its standard (meaning generally accepted by those that use p values) meaning, and severe testing just seems like a reiteration of the textbook meaning of p values (and I don’t have a problem using p values though I reject the use of any particular significance cutoff level). But the whole appeal of severe testing to me is the idea that the concept can cover the spectrum of issues associated with collection of data and its analysis. What I found wanting in Mayo’s book was the lack of any specific guidance that covers those general issues.

So, if a p-value is 0.6, I’d like to ask if it has been severely tested. If any of the issues we think about (as in Andrew’s comment just above) are relevant, then I am reluctant to infer much (if anything) from the p-value. I think those that would say the p value of 0.6 tells us something, must be making a number of assumptions about the study that resulted in that value. And it is precisely those assumptions that seem to concern much of our attention.

]]>Given that people apparently care to test nulls (see https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1497540), you’ve provided evidence against your own hypothesis that ‘no one actually cares’. Maybe no one truly believes a difference is exactly a constant C, where C can be anything, is besides the point that the method works, but having effect = C (or no effect being C = 0) is a good model for modus tollens logic.

A small set of examples where people do care to test nulls:

https://pdfs.semanticscholar.org/37dd/76dbae63b56ad9ccc50ecc2c6f64ff244738.pdf

https://adamslab.nl/wp-content/uploads/2019/05/Confessions-of-a-p-value-lover-20190327.pdf

https://amzn.to/2MABuTn

http://www.statisticool.com/nobelprize.htm

http://www.statisticool.com/quantumcomputing.htm

Try again.

Justin

]]>There are some for sure. High energy physics comes to mind

http://www.pp.rhul.ac.uk/~cowan/stat/cowan_munich16.pdf

As does some work on quantum computing and work by Nobel prize winners

http://www.statisticool.com/quantumcomputing.htm

http://www.statisticool.com/nobelprize.htm

Justin

]]>But to toss a modified question back at Harrell or anyone else, what if say four experiments were sound, no QRPs, and we obtained the resulting p-values from them: .63, .23, .4, and .54? Then we may be justified in saying there is no difference in the treatments in the population.

I don’t feel like you put much thought into this. The p-value is a function of sample size and precision of your measurements. Eg, those could be calculated from four n=3 mechanical turk surveys.

But anyway, no one actually cares whether there is exactly zero difference to begin with.

]]>I’d bet the Nobel prize winners using p-values would disagree with you.

http://www.statisticool.com/nobelprize.htm

Justin

]]>Far from being specific to p-values, such a silly question could also be asked about any single statistic such as a Bayes factor, posterior probability, or anything else. “What is the conclusion of a clinical trial where BF=2.99?”

Who interprets anything, especially a complex clinical trial, based on just one statistic?

But to toss a modified question back at Harrell or anyone else, what if say four experiments were sound, no QRPs, and we obtained the resulting p-values from them: .63, .23, .4, and .54? Then we may be justified in saying there is no difference in the treatments in the population.

Justin

http://www.statisticool.com/objectionstofrequentism.htm

Nice!

]]>That reminds me of a scene in Heinlein’s “Stranger in a Strange Land”. One character is what is called a “Fair Witness” – a highly trained observer whose statements about what he/she witnessed can be taken as fact in a court of law.

She is asked “What color is that house over there?”. The answer – “It’s white on this side.”

Just stating what the data show, no more.

]]>By “genuine” do you mean “free from p-hacking”?

]]>Worth noting, perhaps, that Pr(H_0) is Bayesian, because it say that a hypothesis has a probability. Also, that Pr(Data|H_0) is not right, either, because non-data are also included, as a rule. And, as has been mentioned, that H_0 is not the only hypothesis in the givens.

]]>Michal:

I don’t see why I can’t talk that way in normal conversation. For example, when discussing “power pose” on this blog, I’ve never said that it “doesn’t work” or that it has “no effect.” I’ve consistently said that I expect any effects it has to be highly variable across people and situations, and that various claimed effects in the literature were not supported by the data.

]]>Are climate scientists hard scientists? See http://www.realclimate.org/index.php/archives/2020/03/why-are-so-many-solar-climate-papers-flawed/

]]>And conversely p<0.05 means something like:"There is a barely discernible signal in the noise, but we don't know how big is, could it just be a small measurement bias?".

]]>You can more concretely state it as “The data is too sparse or too noisy to say anything” or “the effect is too small to detect given the level of noise in the data”.

]]>But isn’t that true for any effect? You can only make probabilistic statements about drug working or not working, whatever the definition of “working” might be, based on trial data. But in any normal conversation it’s impossible to talk that way.

]]>> If this is a genuine high P-value, there’s poor evidence for a genuine effect.

A “genuine high p-value”? Every p-value is genuine assuming the assumptions the calculation was derived from are all true, and not genuine if any are false. Almost all uses fall in the second category.

]]>John:

Official guides and best practice manuals are increasingly getting basic definitions wrong.

https://errorstatistics.com/2019/09/30/national-academies-of-science-please-correct-your-definitions-of-p-values/

To their credit, the National Academies of Science did fix the most glaring cases of erroneously defined P-values, after I wrote to them.

To move from “we cannot tell from the results in this study that such and such works”

to

“it doesn’t work”

is a glaring fallacy. For one thing, the study may have had no capability of uncovering that it works even if it does.

Tom:

If this is a genuine high P-value, there’s poor evidence for a genuine effect. It may warrant inferring an upper bound for the population effect (with a severity = the corresponding confidence level)

Agreed. Even many stats professors teach statistics to students in other sciences as a set of software features. In your terms, the statistics in a paper are not the numbers–they’re they paper.

]]>Right. I also just realized that the main post referred to biostat PhD’s, not biology PhD’s, so your observation definitely applies!

]]>I've been trying to develop my ideas around this in the following blog posts (apologies if this self-promotion is poor form, but I assuming linking over is better than trying to re-state my arguments here).

https://news.metrumrg.com/blog/a-significant-tilling-of-the-statistical-soil-the-asa-statement-on-statistical-significance-and-p-values-1

https://news.metrumrg.com/blog/statistical-significance-is-a-result-not-a-conclusion

https://news.metrumrg.com/blog/significance-and-directional-inference

Tom:

I’m happy working “severe testing” into any discussion of p-values. When discussing a statistical method, it’s important to think about not just its mathematical foundations and not just about what it does in practice, but also about the real-world problem it’s intended to address. In the case of p-values, severe testing is part of the goal.

]]>That’s true though it might be easier to do this for the books published by non-statisticians. I would hope these are where more of the errors are and that it will be easier to convince statisticians to clean up the language. I think the issue is that we are so accustomed to short-hand jargon and its longer meaning that we have sometimes forgotten how the correct description is not obvious. In any case, if this problem extends even to bio-statistics, then it is worse than I had thought.

In any case, practically all introductory statistics courses are geared toward getting students running tests and drawing conclusions by the most direct way possible. In this way, we reinforce the idea that statistics is for making decisions rather than modeling uncertainty. It seems to me that a properly taught statistics course has as much in common with a composition course as with a mathematics course. Precision and correctness of language is no less essential than correct calculation procedures. In this regard, another thing working against us is that evaluating the former takes a great deal more time than the latter which can easily be automated.

]]>The premise that the p-value is sufficient information is already so wrong. It isn’t even relevant information.

I’ve been totally ignoring all p-values for years and seem to have a better eye for interesting data than the people publishing it…

]]>True, but I do know one of the universities did have that on the answer key for the final exam in intro stats and the text at that university introduced the p-value with a correct definition that it morphed into something like essentially the p-value was the probability the null hypothesis was true.

]]>I once had a two day fight with my boss over part of a study that went so far as me wondering if I would resign. It turned out that his terminology (economist) and mine (psychologist) were totally different and meant the same thing.

Language is important.

]]>To clarify my above point about adding the caveat “The data are consistent with a large effect also”. I am assuming we have the parameter estimate and confidence intervals to be able to say something about the range of effect sizes. I understand that a p-value alone tells us nothing about effect sizes.

]]>Or, as Mayo would say, the hypothesis has not been severely tested, so it has received only weak support.

]]>So far we have a statement misinterpreting the p-value, “The treatment doesn’t work,” and a correct, though brief, interpretation, “No evidence of a treatment effect.” In my experience, most interpretations of significance in the literature are written as “There is no treatment effect.” Hence, are we saying that if we add “evidence of a” in between the “no” and “treatment” then we’re in the clear? Is the distinction between “good” and “bad” interpretations that subtle? Based on my reading of this blog and papers mentioned herein, I had thought it was critical to add the caveat that Mathijs mentions: “The data are consistent with a large effect also” or, at least, researchers should keep the caveat in mind when writing their results and discussion sections.

]]>As is mine. My comment referred to the life sciences (and, I think, social sciences) mentioned in Raghuveer’s post.

]]>“I’m much more interested in how to do statistics well than in how to do it poorly. “

yes, that is true, the above was facetious.

“I’ve also spent lots of time over the year responding to misconceptions in blog comments. That might be a waste of time”

Definitely not a waste of time. I’ve learned a lot here. I like the debate. Many others also who’s contributions I look forward to reading.

]]>Specify what you mean by “it works”, which should be something like “the overall benefit as measured by U(x) is positive”. Then specify the protocol to be tested, and the statistical assumptions behind the data analysis model in terms of a mechanism, a measurement protocol, a prior for the unknown parameters in the mechanism and measurement protocol, and a probability of observing some data given the parameters, mechanism, and measurement protocol… Then compute the posterior distribution of the parameters, simulate the expected outcome in future uses (the posterior predictive) and show a variety of things like the expected U(x) and the posterior probability that U(x) is positive, as well as the distribution of U(x).

At this point, assuming Expected(U(x)) is positive declare “as far as we know, it’s worth using”, if the p(U(x) > 0) is large (like say 95%) declare “the evidence shows that it probably works”.

]]>So taking it beyond the question of deriving conclusions from a specific set of data…

What soup-to-nuts protocol would be required in order to do a “trial” demonstrating convincingly that a certain ointment does not prevent pimples?

Or rather I mean to ask, is there any conceivable experiment that this group feels can convincingly answer a question of that form with an answer of “it does not work”?

]]>Jim:

Just to be clear, I’m much more interested in how to do statistics well than in how to do it poorly. But it can be useful to study how it is done poorly, in order to understand how to do it well. I’ve learned a lot from all these discussions and bad examples over the years. Even in this thread, I learned something; see here.

I’ve also spent lots of time over the year responding to misconceptions in blog comments. That might be a waste of time, but I think of these discussions as practice for more formal explications.

]]>Biologist: write many papers obtains fame for discovering new endangered species

Statistician: inspired to start a blog about how poorly scientists understand statistics, recognized as expert, obtains some notoriety

Philosopher: dies in a homeless camp under freeway in abject poverty