Blake McShane and Valentin Amrhein point us to an announcement (see page 7 of this newsletter) from Karen Kafadar, president of the American Statistical Association, which states:

Task Force on Statistical Significance and Replicability Created

At the November 2019 ASA Board meeting, members of the board approved the following motion:

An ASA Task Force on Statistical Significance and Reproducibility will be created, with a charge to develop thoughtful principles and practices that the ASA can endorse and share with scientists and journal editors. The task force will be appointed by the ASA President with advice and participation from the ASA BOD. The task force will report to the ASA BOD by November 2020. . . .

Based on the initial meeting, these members decided “replicability” was more in line with the critical issues than “reproducibility” (cf. National Academy of Sciences report, bit.ly/35YBLbu), hence the title of the task force is ASA Task Force on Statistical Significance and Replicability. . . .

Blake and Valentin and I are a little bit concerned that (a) this might become an official “ASA statement on Statistical Significance and Replicability” and could thus have an enormous influence, and (b) the listed committee seems like a bunch of reasonable people, no bomb-throwers like us or Nicole Lazar or John Carlin or Sander Greenland or various others to represent the voice of radical reform. We’re all reasonable people too, but we’re reasonable people who start from the perspective that, whatever its successes in engineering and industrial applications, null hypothesis significance testing has been a disaster in areas like social, psychological, environmental, and medical research—not the perspective that it’s basically a good idea that just needs a little bit of tinkering to apply to such inexact sciences.

I respect the perspectives of the status-quo people, the “centrists,” as it were—they represent a large group of the statistics community and should be part of any position taken by the American Statistical Association—but I think our perspective is important too.

I also don’t think that concerns about null hypothesis significance testing should be placed into a Bayesian/frequentist debate, with a framing that the Bayesians are foolish idealists and the frequentists are the practical people . . . that might have been the case 50 years ago, but it’s not the case now. As we have repeatedly written, the problem with thresholds is that they are used to finesse real uncertainty, and that’s an issue whether the threshold is based on p-values or posterior probabilities or Bayes factors or whatever. Again, we recognize and respect opposing views on this; our concern here is that the ASA discussion represents our perspective too, a perspective we believe is well supported on theoretical grounds and is also highly relevant to the recent replication crises in many areas of science.

This post is to stimulate some publicly visible discussion before the task force reports to the ASA board and in particular before the ASA board comes to a decision. The above-linked statement informs us that the leaders of this effort welcome input and are working on a mechanism for receiving comments from the community.

So go for it! As usual, feel free in the comments to disagree with me.

The way I’ve always seen it is that the underlying problem is that publishing a result is a binary decision – while we might hope that any well done, well conducted result is published and read by everyone, the reality (at least nowadays) is that being published, and being read, and being reported on are outcomes that are constrained in how often they may occur. Therefore, the publication decision becomes inevitably influenced by estimated effect size, sample size, degree of accuracy and all those other bits and pieces that go into significance testing, and a threshold becomes semi-implied even if it’s not explicitly stated.

Trying to avoid an explicit threshold opens up a second sort of error – the temptation by interested actors to dismiss results they dislike as “uncertain”, while loosening the standards to results they want to be true. Maybe we can call this sort of approach “cynical Bayes”…

But even if once defines an “explicit threshold” for claiming “strong” evidence of an effect, no editor, reviewer or reader is obliged to agree that it is a reasonaable threshold for that purpose.

When we see a result we may have interest in the actual P value and many other things, but whether the author labels the result “significant” or “nonsignificant” will have no bearing on how I interpret the result, and I hope the same would hold true for editors and all other readers.

For smart and ethical scientists, they are their own best “gatekeepers.” And for everyone, only good editors and reviewers — extremely rare commodities? —

can do that job well.

Good editors and reviewers, perhaps, but I think readers in general put a lot of credence in what an author of an article writes – especially in Abstracts and Discussion sections. Even in terms of peer review, I have had papers rejected for failing to have a ‘clear message’. It’s a lot of work to interpret for yourself!

Quite possibly “readers in general” do. But given the well-documented and massive misuse of even simple statistical procedure across the sciences, I suggest giving zero credence to Abstracts and Discussions UNTIL one has scrutinized the Methods and Results sections. And, these days, given the dislike of editors of long and detailed Methods sections, the quality of an article requires scrutiny of Supplemental Information files. We skeptics create a lot of work for ourselves!

A related issue your comment would be welcome on. Verbal and communication skills are often not the forte of people who go into mathematics and statistics. The problem is likely more severe (in North America) for those whose first language is not English — and that would include a ton of Chinese folks, both foreign nationals and first generation Chinese immigrants. I believe very few graduate programs in statistics require applicants take the Verbal SAT though they likely require the TOEFL test for foreign nationals.

This factor may not matter too much for mathematical or theoretical statisticians who communicate mostly with each other, and often via symbolic language. But it may matter a great deal for applied statisticians who are advising and writing for less numerate folks in other disciplines. What do you think?

When serving as editor symposia volumes with contributors from China and many other non-English speaking countries, I was happy to invest a lot of time often in rewriting paragraphs or whole sections for the authors, but most editors don’t feel they have that obligation.

I usually advised authors whose first language was not English to have their manuscript checked by a native English speaker before submitting. Of course there are plenty of such people who are both “native” and mediocre writers!

Easiest partial solution: require that all applicants to statistics graduate programs take the Verbal SAT.

I can’t really speak to that a lot unfortunately, I don’t think my personal situation is terribly representative. My sense is that the problem is that finding the combination of native speaker *and* specialist knowledge *and* availability can be quite difficult. Having a scientific manuscript be checked by someone non-scientific is a very dangerous enterprise….

From: Ron S. Kenett

Sent: 21 February, 2020 4:54 PM

Subject: ASA task force

Dear Linda and Xuming,

I am writing to you as cochairs of the ASA task force on Statistical Significance and Replicability putting Yoav on the CC since I had several exchanges with him regarding this topic.

First of all, I am glad ASA decided to launch a task force and, given past experience, I have some thoughts on this worth sharing.

1. ASA statements

A precursor to the p-value ASA statement was a statement on VAM used in education. In my book on Information Quality we evaluate this statement and found it of poor information quality (see attached). I believe ASA should have done some retrospective on it, before moving to the p-value statement, labeled ASA I by Deborah Mayo. The role of modus operandi of professional organizations on such matters is worth discussing, per se. My opinion on this is mentioned in response to question 8 in https://www.linkedin.com/pulse/ten-questions-statistics-data-science-2020-beyond-ron-s-kenett/

2. Reproducibility and repeatability

I am glad the term used for the task force is replicability. The terms reproducibility and repeatability have a specific meaning in industrial statistics and using a coherent terminology is obviously of some importance. The attached note we published in Nature methods was designed to clarify it. The National Academy of Sciences report got the terminology wrong. They also got the definition of a p value wrong but Mayo was able to get this fixed, I was not. What they also did, which is important, is to discuss the concept of generalizability (or generalization). This is one of the dimensions of information quality and I wrote extensively about it.

3. Generalization of findings

Research claims are almost always expressed verbally. I have suggested to dedicate a section in research papers dedicated to generalization of findings using alternative representations. The approach is presented in https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3035070 with examples from clinical and translational research. Also attached is a talk prepared for the ASA SSI 2017 conference in Bethesda where I presented this.

I wish you a productive and effective task force effort and look forward to its outcomes.

Bst

ron

Professor Ron S. Kenett

Senior Research Fellow, Samuel Neaman Institute for National Policy Research, Technion, Israel

Editor of Wiley’s StatsRef and StatisticsViews

Past President, European Network for Business and Industrial Statistics (ENBIS)

Past President, Israel Statistical Association (ISA)

Chairman, The KPA Group, Israel

http://www.kpa-group.com Email: ron@kpa-group.com Mobile: +972-52-2434491

http://www.amazon.com/author/rkenett

Regarding ’embracing uncertainty’ and the implication that frequentism doesn’t because it uses thresholds, I once read paper with a prior on something being in a tight range, like between 1 and 3. Then the posterior was derived and a CI was constructed. The paper discussed how the Bayesian CI was smaller than the frequentist CI that didn’t use such a great scientific prior (implying the frequentist estimate/method is worse). Well smaller width yes, because the prior artificially made the posterior CI smaller. So much for ’embracing uncertainty’. The Bayesian approach here made things more certain, judging by the CI width, in this case at least. So I am not convinced a Bayesian approach necessarily embraces more uncertainty than frequentist approach. Adding to that, “uncertainty” is not the same thing as “probability” IMO. I’d rather embrace probability only and not all types of uncertainty. I’d also rather have large data (n/N close to 1) from well-design experiments and have likelihoods swamp any priors, when possible.

Please let’s not limit the success of significance testing to just engineering and industrial applications. It has also been successful in the other areas you mentioned; social, psychological, environmental, and medical research, as well as other areas you didn’t explicitly mention like survey sampling, law, quantum computing research, and various research by Nobel prize winners. That’s not to say there aren’t QRPs, but there are QRPs with other less-popular methods like Bayesian as well, so… clearly a Prior Task Force is the next logical and fair step. If significance testing is used to help get Nobel prizes, I doubt that any “disaster” can be blamed solely on significance testing and thresholds.

For example, Fisher was involved with highly original mathematics for a test of significance and confidence interval (circle/cone) for observations on a sphere in his Dispersion on a Sphere. Fisher solved the statistics questions posed by Runcorn, whose ideas moved us away from “the static, elastic Earth of Jeffreys (Bayesian-Justin) to a dynamic, convecting planet”.

Some more:

-books “The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century” by Salsburg, and “Creating Modern Probability: Its Mathematics, Physics and Philosophy in Historical Perspective” by von Plato

-“Use of significance test logic by scientists in a novel reasoning task”, by Morey and Hoekstra

-“In defense of P values” by Murtaugh

-“Will the ASA’s Efforts to Improve Statistical Practice be Successful? Some Evidence to the Contrary”, by Hubbard

-“In Praise of the Null Hypothesis Statistical Test”, by Hagen

-“Confessions of a p-value lover”, by Adams, an epidemiologist

-“The case for frequentism in clinical trials by Whitehead

-“There is still a place for significance testing in clinical trials” by Cook et al

-the quality declined in BASP after these things were banned (Ricker et al in “Assessing the Statistical Analyses Used in Basic and Applied Social Psychology After Their p-Value Ban”, and “So you banned p-values, how’s that working out for you?” by Lakens)

-in “A Systematic Review of Bayesian Articles in Psychology: The Last 25 Years” by van de Schoot et al, the popularity of Bayesian analysis has increased since 1990 in psychology articles. However, quantity is not necessarily quality, and they write:

“…31.1% of the articles did not even discuss the priors implemented”

…

“Another 24% of the articles discussed the prior superficially, but did not provide enough information to reproduce the prior settings…”

…

“The discussion about the level of informativeness of the prior varied article-by-article and was only reported in 56.4% of the articles. It appears that definitions categorizing “informative,” “mildly/weakly informative,” and “noninformative” priors is not a settled issue.”

…

“Some level of informative priors was used in 26.7% of the empirical articles. For these articles we feel it is important to report on the source of where the prior information came from. Therefore, it is striking that 34.1% of these articles did not report any information about the source of the prior.”

…

“Based on the wording used by the original authors of the articles, as reported above 30 empirical regression-based articles used an informative prior. Of those, 12 (40%) reported a sensitivity analysis; only three of these articles fully described the sensitivity analysis in their articles (see, e.g., Gajewski et al., 2012; Matzke et al., 2015). Out of the 64 articles that used uninformative priors, 12 (18.8%) articles reported a sensitivity analysis. Of the 73 articles that did not specify the informativeness of their priors, three (4.1%) articles reported that they performed a sensitivity analysis, although none fully described it.”

Speaking of thresholds, at what point do you consider a coin to be not fair in say 100 flips? 99 heads? 90 heads? 80 heads? 67 heads? At least so many heads in so many replications of well-designed experiments? Whatever the sponsor says?

Justin

I am not sure what you are getting at given “I also don’t think that concerns about null hypothesis significance testing should be placed into a Bayesian/frequentist debate, with a framing that the Bayesians are foolish idealists and the frequentists are the practical people.”

Are you disagreeing or agreeing with that?

You’ve mentioned Nobel prize winners before.

Is the tacit assumption is that we should emulate their practice? Just by virtue of having a Nobel prize?

“You’ve mentioned Nobel prize winners before.”

I have, because others have mentioned before that significance tests are “bad for science” before. For example, implying that such tests are only good for industrial applications as Gelman has above. (“…whatever its successes in engineering and industrial applications, null hypothesis significance testing has been a disaster in… “)

“Is the tacit assumption is that we should emulate their practice? Just by virtue of having a Nobel prize?”

Nobel Prize scientists found significance tests useful in their scientific work, and were recognized and rewarded by the scientific community for it. It is not right to not say or imply significance tests are “bad for science” when there is plenty of evidence to the contrary.

A small example: Fisher’s significance test helped, for example, move us away from “the static, elastic Earth of Jeffreys to a dynamic, convecting planet”. ;)

http://www.statisticool.com/nobelprize.htm

Cheers,

Justin

Justin, this is just nonsense as others have brought up many times on multiple blog posts. No one ever argued that a significance test was never useful, and it literally has no relevance whether a Nobel prize winner used them or not. I hear there were some racists, rapists, and potentially fraudulent Nobel researchers too. I guess we should emulate them as well. After all, someone with a fancy (arbitrary, politically determined) prize did it. I strongly suspect that you’re amply intelligent to recognize the obvious logical flaws in this argument, so it leaves me to suspect that you’re arguing in bad faith (again), or just trying to promote your website.

This is just circular reasoning.

Person X used method Y, and got a Nobel prize, therefore method Y is good and should be used.

If an approach is found not to be optimal and can be improved upon, it doesn’t matter one bit whether someone at some point used the method and succeeded. Also, getting a Nobel prize shouldn’t be taken as an indication that the method used was great. Maybe a great approach at the time, but not necessarily for all future…

Fans of counterfactual logic would wonder how much better their work could be if they had never heard of p-values….

I’m with Keith, I’m not entirely sure what exactly you’re trying to say here, aside from filling up the comment section with your perennial need to bring up ‘bayes vs. frequentism’ debates. Most readers of this blog, I think, recognize that all methods have their flaws, and are only as good as their assumptions. The real question is which assumptions you’re more comfortable with. Personally, an assumption of infinite repeated sampling doesn’t really sit all that well with me, at least not when I can avoid it or when I know it makes little sense for the problem I’m analyzing. But I publish using Bayes and frequentist methods alike, and feel absolutely no need to be dogmatic about it.

As for your examples about what you see as poorly reported Bayesian articles: I’d suggest you reflect carefully on your own words that you mentioned in a comment further below, where you said: “If people get “statistical significance” and “practical significance” mixed up, then that is an educational/communication/experience issue, not any problem with the statistical methods.” Do you not see how that same statement applies to your own comment? Perhaps there are educational issues, perhaps there are misunderstanding among reviewers, or other issues at hand (e.g., I’ve been explicitly told by editors to REMOVE discussions of priors, ELPD’s & LOO, etc. and so I often end up forced to put them in appendices).

I’m willing to say that Bayes and frequentist approaches can both be sensible. In both cases, what we’d like to see in science is a careful treatment of data (no blind, canned procedures), a clear discussion of assumptions, and an awareness that ‘significance’ (whether intuited via p-values, CI’s, BF’s, ROPE, etc) of a finding does not mark the end of research of proof of an hypothesis, but rather is only the beginning of the scientific endeavor. Are you willing to join in the discussion here on how to meaningfully improve statistics reporting, or keep doing the easy thing by propping up a Bayes vs. Frequentist debate that I think most of us wish would go away?

“…section with your perennial need to bring up ‘bayes vs. frequentism’ debates.”

I didn’t bring it up. The author of the blog post did explicitly as well as implicitly by leaving out other successes of frequentism and NHST, IMO. It should keep being mentioned because almost no detailed scrutiny is given to the many proposed alternatives to NHST as is given to NHST, IMO..

Justin

More trolling, nice Justin :) Plenty of detailed scrutiny has been given to alternatives, you just need to take a little effort to find it. Sure, absolutely more scrutiny is needed as always (esp things like preregistration, Bayes Factors, power analysis), and I don’t think anyone here is arguing against that (except maybe your own strawman of this blog). I know you’ve got all the answers over at statisticool.com though, so I’ll be on the lookout for your comprehensive, groundbreaking work on solving the inferential issues in the NHST paradigm.

I feel like you would save yourself some time if you actually read the post before responding to it

From the post above:

I also don’t think that concerns about null hypothesis significance testing should be placed into a Bayesian/frequentist debate, with a framing that the Bayesians are foolish idealists and the frequentists are the practical people . . . that might have been the case 50 years ago, but it’s not the case now. As we have repeatedly written, the problem with thresholds is that they are used to finesse real uncertainty, and that’s an issue whether the threshold is based on p-values or posterior probabilities or Bayes factors or whatever.

The way I’ve always seen it is that the impetus for binary decisions and bright line rules is indicative of more fundamental problems with academic, science, and society. Academics are generally frustrated by the fact that they (and their analyses) are not really in charge – they reflexively cover up the uncertainty so they can declare positions, such as X is the right policy, Y improves outcomes, etc. Funders, sponsors, and journalists also like definitive declarations – and will choose whichever positions they wish to promote. This makes it lucrative (financially and reputationally) for all concerned. So there are large forces working to ingrain and enforce binary thinking about evidence.

It would be welcome to see the ASA embrace uncertainty as the standard and eschew binary thinking. Whether that will change anything depends on whether the forces I allude to above will change. I am not optimistic. As the amount of “evidence” expands (exponentially), the complexity of analysis grows, and the time available for careful reflection and humble incremental steps evaporates, it is hard to imagine fighting against these tides. But every step at least helps.

I’m not sure anybody is working to enforce binary thinking, it is just that binary thinking is literally how we naturally understand the world. For example:

consider “An Investigation of the Laws of Thought” by Boole. He wrote that X**2=X, or X(1-X)=0, where X is a set and the operations are set operations, is the fundamental “law of thought”. That is, something in some reference class cannot be in both sets X and 1-X at the same time, or a red red balloon is obviously a red balloon.

Boole used this law to compute probabilities and statistics. Boole’s work was modernized and made rigorous by Hailperin in “Boole’s Logic and Probability: Critical Exposition from the Standpoint of Contemporary Algebra, Logic and Probability Theory” in a linear programming context. Also check out “The Last Challenge Problem: George Boole’s Theory of Probability” by Miller.

Basically that there is nothing wrong per se with using categories (thresholds/alpha/etc.), we all do because it is how we understand the world. Just communicate all assumptions and don’t engage in QRPs,

Cheers,

Justin

I agree that binary thinking is natural – but that is System 1 thinking. It has worked well for eons, when threats were lions and tigers. When we talk about science, however, we are referring to System 2 thinking. Applying binary (System 1) thinking to complex decisions (System 2) is a mistake, however natural it may be. There is something wrong with doing that.

+1

Justin, I think the claim ‘binary thinking is literally how we naturally understand the world” is way too strong. It’s literally one of the ways we naturally understand the world, I’ll give you that! But even simple decisions like, I dunno, where you and your friends choose to go to dinner, usually involve many parameters, and often include implicit or explicit consideration of uncertainties: how long is the line likely to be at place X; given traffic, how long would it take to get to place Y; place Z is new and has good reviews but my friend visited and said it was only fair; etc. etc. If your point is that eventually you have to boil this down to something concrete — meet at place Q at 7:15 — then sure, you need a discrete choice (not binary, in this example). But the decision analysis takes many factors into account, many of them not binary or discrete. And this is true of even everyday decisions.

There is no “natural tendency” to binary thinking in humans or in nature in general.

“How far away is the wildebeest?”. This is not a binary problem. The range to the wildebeest is continuous and any human trying to kill the wildebeest must use a continuous scale to assess the distance to the beast and to gage the thrust of the throw. That’s after the human has chosen which beast to attempt to slay, after assessing the many animals in the heard on a many continuous scales: size? speed? weight? accessibility? Other predators use the same skills.

“what kind of plant is that?” This is not a binary problem either. Likely the average person, today or in pre-history, who knew anything about plants had a multi-level, multi-element classification scheme for them.

The main kind of binary problem solving is the action/no action problem or decision, which comes to bear after all the more complex assessments have been made: the final decision to throw the spear; to eat the plant.

Obviously decision making is binary, but it’s the nature of the problem, not the nature of the mind solving it. Humans solve the problem at hand with the kind of problem solving that’s appropriate for the problem.

It’s cool that frequentist stats work for certain problems. That’s normal in math and nature, certain functions and techniques apply to certain situations, consistent with the assumptions / constraints required and available to solve those problems. It’s also obvious that they don’t work for many many many problems. We should not use them for those problems.

+1 All decision problems are binary or can be rephrased as a binary choice, by definition. A decision problem is just a problem of the nature of “choose __ from __”. And, of course, there is not a decision procedure for all decision problems. But, the fact that we want a decision procedure when given a decision problem isn’t a sign of some pathological thinking. It is exactly what we should do. The problem is that someone wants a general decision procedure for a range of decision problems that are not likely to yield to the same procedure. The editors of a journal should not want a decision procedure that is the same for every submission sent to them no matter the subject area. I wouldn’t frame the problem with null-hypothesis testing as forcing a binary outcome because a decision is a binary outcome, so one is going to have to force a binary outcome no matter what. The problem is that the null-hypothesis test is too general. Why should it work for every field of science no matter the subject matter? Some areas of study have many sources of uncertainty that cannot be easily quantified. It would be better to have many different decision procedures tailored to the subject matter.

“The problem is that someone wants a general decision procedure for a range of decision problems that are not likely to yield to the same procedure. “

+1

+2

““How far away is the wildebeest?”. This is not a binary problem. The range to the wildebeest is continuous and any human trying to kill the wildebeest must use a continuous scale to assess the distance to the beast and to gage the thrust of the throw. That’s after the human has chosen which beast to attempt to slay, after assessing the many animals in the heard on a many continuous scales: size? speed? weight? accessibility? Other predators use the same skills.”

Just saying “far away” implies it is not “near”. So it sets up two categories “things that are near” and “things that are far”. Saying “wildebeest” sets up categories “things that are wildebeests” and “things that are not wildebeests”.

Assessing combinations of all these categories which your brain establishes helps lead to your solution, whether or not what you are measuring, if you are measuring anything at all, is continuous or not.

For example, you need new shoes. Foot length is continuous, but shoe sizes are not. If you need to buy a shoe, you may measure your foot or you may not measure your foot at all – you just go in any try on shoes, or try on things in the category “things that are shoes”.

Justin

It’s a great idea that the data for every decision falls neatly into two groups or clusters. But if you test it, it turns out to be wrong.

Choosing shoes clearly isn’t a binary decision! :) You don’t choose your shoe by binary subdivision until you reach the proper size!:) You measure your foot and select the size. You make the buy or no buy decision when you see the price tag and compare it to a) the dollars you have in your account (continuous); b) the fit of the shoes(continuous multidimensional); c) the comfort of the shoes (continuous multidimensional); d) the quality of workmanship (continuous multidimensional); e) the sex appeal of the logo on the side of the shoe….

Suppose I observe a sequence of quarterly earnings from a firm A. I want to buy a stake in the company for one year. How do I boil down my stake in the company to a binary decision? If Pr(Profitable over the next 10 years > 0.5), then I should buy 100% of the company, and otherwise 0%? Does a decision like that make sense?

Even when a decision is binary, that doesn’t mean it’s rational to use a binary threshold in probability space either. Suppose someone offers me a $500 put option on a firm expiring in 1 year. I can either choose to take it or leave it. I construct some binary statistical test using data on firms taken from the same generating distribution. If Pr(Make money > 0.5) I take the put and otherwise I don’t. But put option returns are unbounded from above. I can have Pr(Make money = 0.75) and Pr(Lose a million dollars = 0.1) and it would still pass my simple threshold. It makes much more sense for me to construct some kind of logarithmic utility function a la the Kelly criterion and choose to buy or not buy based on its expected value. Can you think of a good binary threshold to use and explain why it’s similar to the utility maximizing approach?

“Even when a decision is binary, that doesn’t mean it’s rational to use a binary threshold in probability space either.”

+1

I don’t believe this is how we “naturally understand the world”. In conversation we may talk about “tall trees” and “short trees” without implying we or others gain anything by trying to fix a “critical height” separating the two categories.

If any person or institution does try to specify a “critical height” for some purpose, no one should feel obligated to accept or agree with their definition of “short” and “tall”, especially if a decision of short vs tall is going to be used for some later purpose or action.

I think there are two issues that need to be distinguished:

-When to use the word “significant”. It has at least two meanings, and many people get them mixed up. “The results are significant” sure sounds like the effect is big, repeatable, important… So I think that word simply should be banned from science. It is ambiguous and has been misused too much.

-When to make a decision, or reach a conclusion, based on a p-value being smaller than a threshold. This is useful in some contexts, not at all useful in others, and has been overused and misused and has led to bad science and confusion. Guidelines might be helpful. Scientists should know there is more to data analysis than p values. Journals shouldn’t use P<alpha as an acceptance criterion.

If people get “statistical significance” and “practical significance” mixed up, then that is an educational/communication/experience issue, not any problem with the statistical methods.

There is already say equivalence testing that very clearly an explicitly defines practical significance at the outset by the user setting their smallest effect size of interest (SESOI).

Justin

Justin, education on this issue and other related confusions about statistical significance has been tried for decades to no avail. What makes you think it will be different this time? Do you have a curriculum and empirical evidence that teaching with this curriculum avoids these pervasive confusions as discussed in McShane and Gal (2017) and Gigerenzer (2018)? If you do not, I will wait for you to acquire it before taking your educational angle seriously as the evidence stands starkly against it.

+1

The human brain craves certainty. “Significant” seems to offer it. It is a loaded word, that is often (maybe usually) misunderstood. Of course education is important. But I’d go further. It is quite possible to make a decision based on p<alpha without using the word "significant", and I think science would be much better if that word were never used.

Why do people love to gamble? Why do people love to watch sports contests? The brain crave certainty. Not so much.

Good point.

I’d also add re Harvey’s comment “The human brain craves certainty”: Maybe in it’s most naive state. But it can be trained to be skeptical of certainty. Mine seems to have gone in that direction (or is it just old age? — as in, “Nothing’s certain except death and taxes.”).

Justin, I think that the duality of “significant” that Harvey refers to is not the one you have in mind. Instead, it is the Fisherian “significant” which means worthy of further investigation and the Neyman-Pearsonian “significant” which is license to “discard the null hypothesis”. You can read more about it in many places, but I will self-servingly point to my recent extensive paper explaining the place of P-values in scientific inference: https://arxiv.org/abs/1910.02042

If this is the same Karen Kafadar who wrote the below columns it seems like these concerns are valid and she chose committee members to validate her point of view:

https://magazine.amstat.org/blog/2019/04/01/pvalues19/

https://magazine.amstat.org/blog/2019/06/01/unintended-consequences/

https://magazine.amstat.org/blog/2019/12/01/kk_dec2019/

I remember reading these at the time and thinking that she was not being fair at all. Fortunately this blog has a very wide audience and ideas can be expressed openly and transparently here!

BenC (why are so many commenters fearful of using their real, full names? A bad sign in and of itself.)

It would indeed be useful to know Task Force opinion on one central issue, at least — the rationale and desirability for verbally dichotomizing the P scale into “significant” and “nonsignificant” based on a fixed alpha or critical P value.

Maybe Karen Kadafer will provide that.

On her Dec 2019 editorial (https://magazine.amstat.org/blog/2019/12/01/kk_dec2019/), I just posted the following comment, and then sent the same to her by email:

This editorial, like much other recent literature and blogging, does not clearly distinguish between:

1) the very LARGE number of statisticians and scientists who would disallow the phrase “statistically significant” but would otherwise allow the full panoply of frequentist (and other) statistical methods that yield P values, as well as allow confidence intervals, power tests, etc.; and

2) the very SMALL number who would ban the calculation and publication of P values, and terms such as “significance test” (or, better, “neoFisherian significance assessment”) or “statistical significance” or “significance level” (as sometimes, albeit inappropriately, used as a synonym for “P value.”)

Hopefully, the first category has “significant” representation on the new Task Force. This will slow down its deliberations markedly, and perhaps quite discombobulate matters. But it is essential if the Task Force’s final report is to be credible and cogent.

You should poll your Task Force members now!

Way to do that with greatest clarity might be to simply ask, “Do you agree or disagree with the recommendations of: Hurlbert, S.H., R. Levine and J. Utts. 2019. Coup de grace for a tough, old bull: “statistically significant” expires. The American Statistician 73(sup 1):352-357.

Maybe you should make the anonymized results of that poll public now…….

**************************

Her noncommital response of a few minutes ago: “Thank you for taking time to comment on my President’s Column. I wasn’t sure if anyone was even reading them! I appreciate your thoughts, positive and negative.”

Stuart Hurlbert said: “BenC (why are so many commenters fearful of using their real, full names? A bad sign in and of itself.)”

I doubt if it’s a matter of being fearful. I think it’s mostly a matter of being informal or lazy. I signed in as just “Martha” until another person signed in as just “Martha”, after which I decided to use “Martha (Smith)” to distinguish myself from other possible Marthas, while keeping some of the informality of using just first names. (My last name really is Smith; I’m not using Smith to indicate anonymity.)

It can simply be a matter of a legal or employment responsibility not too publicly comments.

While the task force has a lot of distinguished members, I am struck by the lack of fresh perspectives in the group. One (certainly imperfect) measure is the date when the members finished their graduate training. While I didn’t carefully check all of the members, I don’t think any finished after 2010, and in a quick look I couldn’t find any who finished after 2000. Most finished before 2000, and many well before 2000. Of course it’s possible to change perspective over a career (for example, the owner of this blog), but it is odd to have so few (even somewhat) recent graduates in the group.

Apropos of nothing: Andrew, twenty years ago I met Kafadar at a conference, and ran into her at the airport on the way home. I told her about our work on statistical artifacts in parameter maps (http://www.stat.columbia.edu/~gelman/research/published/allmaps.pdf ) and she was encouraging of more work in the subject, which is part of what led us — you and me (and Chia-yu Lin) — to write “A method for quantifying artefacts in mapping methods illustrated by application to headbanging” (http://www.stat.columbia.edu/~gelman/research/published/headbanging.pdf ). Indeed, I see we cited Kafadar in that paper.

One dodge which comes up in software engineering is to sometimes sidestep deep top-down dilemmas and focus on what’s “real” bottom-up.

One way to do this is to focus on education. What exactly should beginning and early students of statistics know about all this? Seriously. I have many statistics textbooks on my shelf, and I find them all wanting. On the other hand, whenever I’ve had conversation with a statistics teacher or colleague I come away from the conversation knowing more than before.

Why don’t I get that same effect from reading statistics pedagogy?

For one thing, I think pragmatics gets lost in the textbooks, but is in full bloom in conversations.

“whatever its successes in engineering and industrial applications, null hypothesis significance testing has been a disaster in areas like social, psychological, environmental, and medical research—not the perspective that it’s basically a good idea that just needs a little bit of tinkering to apply to such inexact sciences.”

I’ve been teaching a statistical modeling class for engineers and this has been something I’ve run up against as I try to move away from the NHST paradigm. Especially in the case of engineering and industrial applications, we are often faced with what is ultimately a binary decision to make–we either include the variable in the model or not, we either institute the treatment or not, etc. Thresholding our model estimates in order to transform the results into discrete actionable decisions is a necessary thing to do.

Now I understand why this is certainly not a satisfactory procedure for discovering the answers to theoretical questions in the fields you mention, but many of these theoretical questions are of interest to us because we wish to implement them in practice– we need to map our data into discrete action spaces as in engineering and industry–either you get the chemo or not, either we tax carbon emissions or not, etc. How can we effectively accomplish this without recourse to some kind of thresholding?

“… but many of these theoretical questions are of interest to us because we wish to implement them in practice– we need to map our data into discrete action spaces as in engineering and industry–either you get the chemo or not, either we tax carbon emissions or not, etc. How can we effectively accomplish this without recourse to some kind of thresholding?”

The thresholding needs to be done carefully, with input from various stakeholders, and often with caveats on the decision-makers’ priorities (which might be different for different stakeholders). Transparency is really important, and “conclusions” need to be clearly stated as being dependent on the priorities and assumptions that go into the decision. This kind of transparency seems to be rare.

Really “either you get the chemo or not” and “either you tax carbon emissions or not”???

**BOTH** of those are dose-dependent and can be restated as:

give dose X of chemo

tax carbon emissions at rate $x/megagram

where x = 0 is one of a continuum of possibilities.

In theory, yes, in practice, this just doesn’t happen. We don’t have a simple exponential transform to replace tax brackets. I don’t think oncologists dose chemo on a continuous scale– at least not until after an minimal effective dose has been given, i.e. you’ve passed a discrete threshold.

I’m currently working on a system which recommends carbohydrate treatments for people with diabetes. How well do you think recommending they take 0.2g/kg bodyweight carbohydrate treatment? Can you figure out a way to get someone to eat 17.236g of carbs?

Maybe the radical bomb throwers should organize their own committee. Try to reach consensus on an alternative set of guidelines. That would put pressure on the ASA group and give journal editors and others a second set of standards they could adopt.

.. and next, they’re going to put together a Task Force on another useful tool, the hammer, citing its utility in building houses but not in swatting mosquitoes. As Andrew (I think) implies, we should be looking for the best way to study X, and not focusing on a when and where to use a specific tool.

+1

Maybe the radical bomb-throwers should form their own panel and come up with their own recommendations. That would put some pressure on the ASA team and give journal editors a second choice of standards to implement.

(I tried to post this earlier from my phone, sorry if it’s a repeat.)

Man, if only the social sciences had some sort of knowledge base for making complex decisions based on diverse viewpoints and evidence-based conclusions, we wouldn’t have to resort to antiquated, “gatekeeper” institutions like task forces created by boards….

Here’s an idea: get 5 scholars who represent 5 distinct points of view across the spectrum, give them the resources to assemble the best empirical and rational arguments for their respective positions, publish their papers, iterate the process a couple of times. Have the 5 scholars or their proxies collaborate on a single paper establishing core principles, including a minority report. Create a manual. Revise the manual every four or five years. Boom.

you left out the dueling with weapons of their choosing

Also, given that a key contributor to the replication crisis is journal practices like favoring findings that are novel and surprising, and shelving results with large p-values, is the best organization to be organizing this a publisher of ~8 journals?

ALSO also, speaking of open science, the NAS report they reference can be obtained electronically for the low, low price of $55!

Michael:

I really don’t trust the National Academy of Sciences. I’m sure they do some good things, but when it comes to the replication crisis, it seems that their main role is a sort of incumbency protection, trying to advance the interests of the already-powerful people in the scientific establishment.

Andrew, I shared this accounting analogy with you many years ago, but here goes again.

Accountants have debated two different ways of accounting for a firm’s assets. The traditional way is to record the asset at its original cost when purchased, and then reduces it’s value over time using systematic but fairly arbitrary methods. This is like NHST, in that it has no sound theoretical basis–it is a mix of theories that are rather incompatible, just like NHST relies on some weird combination of Neyman-Pearson + Fischer. It has no good theoretical basis, but everyone understands how it works, and while it leaves reporters some degrees of freedom, everyone understand the limits of the number. It hides a lot of uncertainty, not of the binary sort, but of the ‘this is how we calculate the number so let’s just figure out how to use what everyone understands’ sort.

The alternative is fair value accounting, where every period you update the asset’s recorded value to what it would be sold for in a well-functioning market. This is theoretically sound, like Bayesian methods, but it involves a lot more work and judgment, no one agrees on exactly the right way to deal with it, and frankly, revealing the uncertainty is viewed as more a bug than a feature.

Fair value methods are being incorporated into accounting in a fairly slow but steady way–it gets used where people see the old methods as being weakest, and fair value the most manageable. But as people see fair value is used in these places, they start getting more comfortable with seeing them elsewhere.

So here’s my question: can NHST be replaced with Bayesian methods in a handful of places that will lay the groundwork for broader acceptance elsewhere?

It’s worth thinking more about who/what “the ASA” is in contexts like this.

When should “principles and practices” agreed upon by a small group (even with input from others that is harder to document) be officially endorsed by the organization or labeled as an “ASA statement”? There may be potential benefits to “the ASA” claiming ownership, but there are also dangers — particularly if statisticians who are members of the ASA do not agree, or at the very least feel uncomfortable with wording or recommendations. I realize there is a precedent with the Statement on P-values and I believe it comes from good intentions– but it is worth pausing to consider potential implications.

“agreed upon by a small group”

This sure looks like a slow motion train wreck, just beginning. Picking the participants from the top is a mistake given the circumstances, they should have been nominated by their peers to represent different schools of thought.

Another classic mistake is the vague remit. After endless meetings that go around and around on the same topics, the final result will be a top-down epistemological summary with no practical value (consider alternatives!, establish confidence intervals!), followed by endless bickering about definitions of terms and other minutiae. That is not intended to say anything about statisticians, the result of a vague committee remit would be the same in any field of academia.

I have noticed that statisticians tend to all repeat a certain phrase, “I don’t believe in canned methods.” NHST is a canned method, which is exactly why it became so widespread. While it makes sense that statisticians are wary of having the same thing play out with a different method, the alternative is so bad that it is not really an alternative. That would be to just say that there in no substitute for a deep understanding of statistical theory, because only then will you be able to use the best method under all circumstances.

It is incumbent upon the field of statistics to define research methods that work well with specific types of research questions. While some here have tried, I have seen no convincing arguments that there are so many methods that they cannot be summarized. If there really is no way to isolate a single treatment and its effect on human behavior, then that should be clearly stated. How would a statistician study himmicanes? If there is no agreement on the question, the future looks dark.

Perhaps, a generic largely mechanical (simulation based) method to assess performance of almost any proposed method of statistical analysis might reduce the risk of investigators using canned methods to acceptable levels (or a least minimize the laundering of true uncertainties).

Sure use a canned method, but undertake a bespoke assessment of its performance informed by the investigator’s sense of benefits and costs.

Work is ongoing in Bayesian Work Flow that can be modified for frequentist methods that _should_ enable this fairly widely.

Some day ;-)

We kind of need binary decisions in medicine – should we use this treatment or not…

But you should never make those decisions on the basis of the answer to the question “would it be rare for a a call to rnorm to produce my data?” which is usually the meaning of a small p value.

Indeed, and in my mind the whole thing about “having to make decisions” is misconstrued. Yes, we have to make decisions, and sometimes those are in the form of binary decisions–so let us make those decisions! Outsourcing this to a p-value is THE OPPOSITE of making a decision, since when you do that, you are NOT MAKING DECISIONS by definition: making a decision implies that you are doing something, weighting evidence, utilities and what not… doing what a random number–what a p-value essentially is–tells you is for the exact opposite of decision making and nothing but a flipism* with extra steps.

https://en.wikipedia.org/wiki/Flipism

I have the choice between treatment A and treatment B. A properly conducted RCT with the null hypothesis of no difference between the 2 treatments returns a small p value, suggesting that the difference between the 2 treatments is not zero. Would it not be reasonable then to choose the treatment with the better outcome? A binary decision based on a p-value.

Nick:

It depends on what is known about treatments A and B. If you have no knowledge distinguishing them (as for example would arise if the labels A and B were themselves assigned randomly), and you have to choose just one, then the right decision is to choose the one that did better in the trial, period. Any statistical calculations arise when trying to decide whether to decide now or to gather more data, or when there is additional information on efficacy or cost of the treatments, or if one of the treatments is the incumbent and there is a cost to changing, or if you have general prior information that new treatments are probably worse, or probably better, or if you have data on risks as well as efficacy, or . . .

All true.

Good trials are carried out when there is equipoise – we are really not sure enough which treatment is better. Good trials have endpoints that are relevant and include benefits and harms.

Good treatment decisions include costs and the principle of social justice.

My point is, that within this sort of rational, inclusive framework there is nothing wrong with a decision based on a p-value threshold.

But what do you need the p-value for in this case? If you have absolutely no other information to go on your best bet would be the treatment that worked better in the trial, regardless of the p-value.

Hans:

Yes, exactly.

No. If the p-value is large then we would say we are still uncertain whether one treatment is better than the other. Either further research is needed or, if the treatment comparison has been sufficiently tested, then we conclude the treatments are practically equivalent.

The idea is to not be led down the garden path. As Andrew mentions, there is a cost associated with replacing an existing treatment with a new one. We don’t want to adopt a new treatment without sufficient evidence that it is indeed better. Without an evidential threshold we would be like children running through a maze – “Let’s go this way.” “No! Let’s go that way!”

Actually, even in this case it is more complicated. Virtually all treatments have side effects. So, at a minimum you will need to choose between treatments that differ in terms of effectiveness (may differ, given the evidence) and in terms of side effects (may differ, given the evidence). Either way, there is still uncertainty that must be considered. But I agree that the p value is of little use in making the decision.

+1 to Dale’s comment

I think a better example is when you have two studies on a treatment, one study shows A is better than B, the other study shows B is better than A, but while the first study has p = 0.01, the other has p = 0.2. Obviously there’s maybe better ways to do the meta-analysis, but if those are the summary statistics available to you, it seems reasonable to claim that A looks better than B on the totality of the evidence.

> If you have no knowledge distinguishing them (as for example would arise if the labels A and B were themselves assigned randomly), and you have to choose just one, then the right decision is to choose the one that did better in the trial, period.

As long as you mean “did better” to mean “provided the best utility”.

Treatment A for a skin cancer is cut it off with a knife.

Treatment B for a skin cancer is ablate it with a 1B$ laser.

Treatment B results in lower recurrence rate. Obviously we should spend $5000B to get 5000 of these lasers one for each dermatologist on a big list right?

Daniel:

I said “If you have no knowledge distinguishing them…”. You gave an example in which you

dohave knowledge distinguishing them. You know that treatment A is free (or, in the US context, a few thousand dollars to pay for the doctor’s time and the paperwork), while treatment B costs a billion dollars. That’s relevant knowledge!Sure, but when it comes to making decisions about real-world problems there are LOTS of ways for people to frame the decision as being somehow in a vacuum or a particular context or whatever so that they can then force the decision to go the way they want.

The truth is we *always* have background information which is relevant, and the posterior probability that rawmeasureA(ThingX) > rawmeasureA(ThingY) is almost never the truly relevant decision criterion.

Daniel:

Yes, I agree. My point in that comment was that even in an idealized, unrealistic, situation of no prior information, it would not make sense to make the decision using the p-value. When prior info is available, it’s even more clear that the p-value is not a good decision tool.

Ah, I didn’t understand. yes you’re right, if literally *all* you have is the average outcome in two groups, just use the one with the better average, regardless of the p value. In fact the p value is just biasing you towards the “incumbent” treatment, a fact that is not lost on “incumbent” makers of billions of dollars of profit.

Nick, In any RCT there are MULTIPLE endpoints relating to both efficacy and potential negative side effects, with different estimated effect sizes for each. No decision about treatment of patients or about marketing decisions by, e.g., pharmaceutical companies will be made on the basis of a single P-value. And the ultimate decision should be the same regardless of whether the individual P-values are labeled (e.g. significant, non-significant) or not.

We can pretend to dream up “objective” recipes, but in fact what action is taken by a decision-maker (e.g. a doctor or VP for marketing or FDA reviewer) must be based on a SUBJECTIVE weighing of all the evidence from many tests.

+1

I am curious what fraction of this task force would endorse each of the six principles from the 2016 ASA Statement on Statistical Significance and P-Values, and in particular what that fraction would be for Principle 3 (Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold). I am guessing each of these would be far from unanimous.

The ASA would do great harm to itself and the profession as a whole if in a just few short years they reverse or pull back on their own six modest principles.

It’s worth reading, Overtreated by Shannon Brownlee, Code Blue by Mike Magee, and American Sickness by Elizabeth Rosenthal. Riveting accounts.

Suppose study A reports a significant effect of X on Y after controlling for Z, while study B reports that the effect of X on Y is not significant after controlling for Z. Would any semi-serious consumer of research interpret this as a logical contradiction? They might describe these as contradictory findings, but surely they would realize this is not a logical contradiction. But if hypothesis testing appears to transform uncertainty into certainty, wouldn’t inconsistent, apparent certainties make for an apparent logical contradiction? The fact that nobody really thinks this suggests that these research consumers understand, at some level, that results of hypothesis tests are themselves uncertain, and that the transformation involved is merely from the quantitative to the qualitative, and not from the uncertain to the certain.

I think the most common response to your scenario, from both lay people and experts, is to reject either study A or study B as being wrong or poorly done. That is, I think the desire for conforming certainty is so strong, that ambiguous conclusions are rejected.

Well, if the quality of the studies differ, it’s reasonable to weight the evidence they present differently. If the claim is that when the quality of the studies doesn’t actually differ, people will pretend that it does, that seems like an empirical claim, and seems like it would vary across research context in terms of how prevalent it is (variation I suppose we should “embrace”). And in any case, if the desire for certainty is so strong that people will deluded themselves about what’s actually going on in the studies they’re reading about, I guess I don’t see how moving off of hypothesis testing to any other approach is going to fix this. Nor do I understand what causal role the use of hypothesis testing is supposed to play in generating these outcomes. It would seem if we just presented interval or distribution estimates people would find some other arbitrary way of collapsing the wavefunction, so to speak.

Dale said, “I think the desire for conforming certainty is so strong, that ambiguous conclusions are rejected.”

Ram said, “And in any case, if the desire for certainty is so strong that people will delude themselves about what’s actually going on in the studies they’re reading about, I guess I don’t see how moving off of hypothesis testing to any other approach is going to fix this.”

Both points seem reasonable to me. My conclusion: We really, really need to focus on teaching about and preaching about the importance of accepting uncertainty as part of real life. I think this needs to start early. Musing on why I am more than averagely accepting of uncertainty, four items in my high school learning come to mind:

Learning about limits and asymptotes in calculus: You can talk about the limit, it’s a real concept — but that doesn’t mean you’ll necessarily ever “get” there.

A course in World Literature: it helped give me become aware of how culture can affect how someone looks at something. (This was furthered by a freshman college course in anthropology.)

Learning the basics of probability — a little in Advanced Algebra, and more in an NSF-funded summer program.

A third semester of physics my senior year in High School, including an introduction to quantum mechanics and relativity. It used a textbook that was of the physics-for-liberal-arts-majors, but had an impact.

My conclusion: We (as a society) need to start teaching students to accept uncertainty at least as early as high school — maybe even earlier.

Apologies for poor proofreading before clicking:

” it helped give me become aware of ” –> it helped me become aware of

I wish that the ASA had simply come out clearly denying that Wasserstein, Schirm, and Lazar (2019) was a continuation of the 2016 policy document on P-values, and, more specifically, that it does not endorse it, rather than creating a “task force”. This route creates a tension that might have been avoided. ASA, insofar as it takes up these issues, should be a forum for discussion on different views especially when, as Wasserstein et al., 2019 admit, there is disagreement. Else, people are pressured to go along to get along, encouraging the very attitude that leads to uncritical statistics. (Wasserstein suggested, in a recent talk, that he hopes this committee will just adopt his recommendations, which is bound to put pressure on them.) I say that there’s so much that is unclear regarding key issues–notably, whether to have a “threshold” or not–that taking a hard (binary) line (e.g., “no”), forcing the FDA to abandon its analysis of certain drug trials, is a bad idea. To me, unless it’s stipulated in advance that some results will not be allowed to be construed as evidence for a given claim, then you don’t have a test of the claim. That doesn’t mean we use them in an unthinking, recipe-like manner.

In my “P values thresholds” editorial, I say

https://onlinelibrary.wiley.com/doi/10.1111/eci.13170

It might be assumed I would agree to “retire significance” since I often claim “the crude dichotomy of ‘pass/fail’ or ‘significant or not’ will scarcely do” and because I reformulate tests so as to “determine the magnitudes (and directions) of any statistical discrepancies warranted, and the limits to any substantive claims you may be entitled to infer from the statistical ones.”… We should not confuse prespecifying minimal thresholds in each test, which I would uphold, with fixing a value to habitually use (which I would not). N‐P tests call for the practitioner to balance error probabilities according to context, not rigidly fix a value like .05. Nor does having a minimal P‐value threshold mean we do not report the attained P‐value: we should, and N‐P agreed!

I am very glad you raised this issue. Can you point to any evidence of confusion that the Wasserstein, Schirm, and Lazar (2019) editorial letter introducing the TAS special issue was anything more than an editorial letter, whether an official ASA statement or, as you nicely put it, “a continuation of the 2016 policy document on P-values”?

I know you have asserted it is, calling it “ASA II” in some of your blog posts. However, I am unaware of anyone — Wasserstein, Schirm, Lazar or anyone else for that matter — claiming any kind of official status for it or demonstrating confusion about it.

If there is evidence of widespread confusion, I agree a clarification should be issued but I am at present unaware of such evidence.

I would respond to Mayo by asking these questions:

1. Which specific ASA leaders would have the right to speak for all of ASA and not endorse the personal statement of Wasserstein et al. (2019) which would in effect constitute those leaders saying ASA rejects, inter alia, the proposal to halt verbal dichotmization of the P scale.

2. Don’t you greatly over estimates the degree to which scientists, even young ones, will yield to political pressure and go along to get along. On the other hand, the majority of statisticians, both young and old, don’t know the historical literature in enough depth to assess many key issues critically.

3. The indefensibilty of “taking a hard binary line” has been understood for half a century by anyone highly attentive to the literature, practice by the masses of statisticians and scientists notwithstanding.

4. Disallowiing the term “statistically significant” (as recommended by many), would not force the FDA to do anything other than modify its language.

5. As a philosopher and not a scientist who has to write up experiments involving many different response variables and statistical procedures, aren’t you, by focuing only on simple examples, greatly underestimating how much elaborate statistical baggage in mss would be created for your proposal for “reformulating tests”? I think those “reformulated tests” would generally be asking questions not relevant to the objectives of the study or of much interest to the researchers (or editors!).

6. You say you favor “prespecifying minimal thresholds in each test” but not “fixing a value to habitually use.” So if you have a study where a dozen different statistical tests (or significance assessments) are conducted, maybe the threshold for some might be set at .001, for others at .02, and for the rest at .05. Have you seriously considered the consequences of doing that? First, I believe you are a fan of the fashion of adjusting for multiplicities the critical threshold for claiming significance — but all current methods for doing so involve obtaining a single adjusted threshold that will be used for all individual comparisons or tests. (Would be a good reason for you to oppose all those methodologies — we should put you on the new task force!).

Second, by using a variable threshold, the meaning of “statistically significant” (a term you still regard as useful and necessary) will vary from one paragraph or section of your ms to another.

Third, This could not help but lead to very confusing and unclear writing. A philosopher sitting on the sidelines might only be bemused, but the editors and reviewers (If the ms was even sent out for review) would not be.

I enjoy and value the discussions I have had and expect to continue to have with Dr. Mayo about these matters.

However, just for the record, I’m fairly certain that I did NOT say that I hoped the task force would just adopt my recommendations. (Eventually the video of my talk will be placed online and that will settle the matter.) Regardless, the reality is that the task force has a whole range of options available. One of these many options is to endorse one or more of the numerous recommendations of WSL 2019 and the 43 other articles in that special issue of The American Statistician.

The idea that task force members are pressured by a statement that I may not have even made in a small conference they did not attend is unreasonable.

To close the loop on my comment above: The video of my talk is posted. I am relieved to report that I said that the task force was going to consider what it thought the ASA might be ready to endorse, and that “it could be the work that we did in that editorial, and it may very well not be.” I did not say I hoped they would adopt them. (See about 34 minutes into the recording at https://www.youtube.com/watch?v=QucLsumQ3n0&list=PL-mariB2b6NugvvjAFeAjK-_-Y6wXCkvM&index=24&t=0s.)

I’m glad that occasionally I actually express what I intended to convey. :-)

Anon:

To limit myself to a single link (else this comment will get stalled), this Dec. post should do:

https://errorstatistics.com/2019/12/13/les-stats-cest-moi-we-take-that-step-here-adopt-our-fav-word-or-phil-stat/

This is 9 months after Wasserstein et al., 2019 was published.

President Kafadar:

“Many of you have written of instances in which authors and journal editors—and even some ASA members—have mistakenly assumed this editorial represented ASA policy. The mistake is understandable: The editorial was coauthored by an official of the ASA.”(Kafadar)

In addition, it’s how it was written:

“The ASA Statement on P-Values and Statistical Significance stopped just short of recommending that declarations of ‘statistical significance’ be abandoned.”[2] “We take that step here.”

See also Note 4, and other notes.

I distinguish the 2019 statement from the “P value Project”. All uses of ASA II on my blog will be updated appropriately.

Thank you so much for your kind response. I appreciate your taking the time to engage and for pointing me to your blog post.

If I am understanding you correctly then, your empirical basis for the claim of widespread confusion that the Wasserstein, Schirm, and Lazar (2019) editorial letter is an official ASA statement amounts to (i) your naming it on your blog “ASA II” and (ii) Karen Kafadar claiming to have received some written correspondences reporting confusion. Forgive me if I retain the null of “no confusion” until more data come in.

For what it’s worth, I view your claim that “Wasserstein et al. 2019 describes itself as a continuation of the ASA 2016 Statement on P-values” based on the two consecutive sentences you quote above and in your blogpost as a tortured reading. When I read the editorial back in April, it read to me like

“They [ASA] said that there. We are saying this here.”

not as trying to link the “we” and the “they” as a “continuation.

I suppose reasonable people might disagree about these two sentences, but here I think context helps. First, two of the three authors of the editorial have no formal ASA status (beyond being perhaps members / fellows but not officials like the one). Further, see an extended quotation from the editorial letter below that seems to me to make clear that the editorial is not “ASA II” as you call it, even noting “the ideas in this editorial are likewise open to debate.” I honestly do not see how a reasonable person would come to se editorial letter as an official ASA document unless it explicitly claimed to be so.

#####

Wasserstein, Schirm, and Lazar (2019)

“The papers in this issue propose many new ideas, ideas that in our determination as editors merited publication to enable broader consideration and debate. The ideas in this editorial are likewise open to debate. They are our own attempt to distill the wisdom of the many voices in this issue into an essence of good statistical practice as we currently see it: some do’s for teaching, doing research, and informing decisions.

…

The ASA Statement on P-Values and Statistical Signi cance stopped just short of recommending that declarations of “statistical signi cance” be abandoned. We take that step here. We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term “statistically signi cant” entirely. Nor should variants such as “signi cantly di erent,” “p < 0.05,” and “nonsigni cant” survive, whether expressed in words, by asterisks in a table, or in some other way."

Sheesh, even ASA I created confusion! Another statistician told me that ASA I said that it meant “ASA recommended not using significance levels at all for scientific experiments”.

Justin

Justin, I like others above in this thread here and elsewhere on this blog really do wish you would read before commenting.

In this subthread, Professor Mayo has been kind enough to engage in a discussion about whether or not there is evidence that there is widespread confusion about the status of the WSL 2019 editorial letter, that is, whether it is a mere editorial letter as it bills itself or whether it is an official ASA statement. She puts forth her own labeling of it as “ASA II” as well as hearsay as evidence of confusion that is is an official statement. She has also excerpted two consecutive sentences in WSL 2019 as a potential explanation for this confusion.

In response, I have suggested this does not constitute strong evidence and the two sentences imply no such thing when read alone but especially when read in the context of the entire letter and the fact that two of the three authors having no ASA affiliation.

Regardless, we here are discussing the status rather than the content of WSL 2019.

You raise an issue regarding the content rather than that status of a different document. Specifically, you suggest a single person is confused over the content of the 2016 ASA Statement which you label ASA I.

Again, this discussion here is about the status of WSL 2019 and I submit no reasonable reader of this document could come to the conclusion that it represents an official ASA policy statement.

As to any issues around the content of the 2016 ASA statement that you raise, that it is an entirely different matter not germane to this discussion. That said, I think both Professor Mayo and I and all reasonable people would unambiguously conclude the 2016 ASA statement is indeed an official ASA policy statement as it labels itself as such.

Yes, ASA 2016, was an official report or policy statement put together by a large committee of ASA. It deals mostly with factual matters, has some inclarities (e.g. esp the ambiguous claim that a P of .05 is “weak evidence” of an effect), but does not presume to dictate to ms authors.

Watch how all of a sudden everyone will figure out rejecting the null hypothesis does not mean accept the research hypothesis. It was surprisingly found that active smokers are ~5-100x less likely to be diagnosed with the coronavirus:

https://onlinelibrary.wiley.com/doi/pdf/10.1111/all.14238

Turns out the same thing was true for SARS. They just kept “adjusting” until it was no longer significant:

https://www.researchgate.net/publication/254000504_Smoking_and_Severe_Acute_Respiratory_Syndrome

Statistical significance and misinterpreting the coefficients of arbitrary statistical models is literally killing people and shutting down entire countries.

Really? That’s how you interpret the paper?

Of the non-smokers in the dataset, 50% of them were hospital workers. Meanwhile 10% of the smokers were hospital workers. You would absolutely adjust for this confounding variable.

The problem is you can “adjust” in millions of different ways, each giving a different result. It is an arbitrary number that depends on what data you had available and what model you choose to use.

You can adjust in a million different ways sure, but some ways are better than others. It’s blindingly obvious that to be infected by SARS, you must first be in the proximity of the virus. And it’s well documented that SARS spread rapidly amongst healthcare workers in hospitals where SARS patients were being treated. Healthcare workers smoke a lot less than the general population. How can you propose to ignore this?

I don’t propose to ignore it. The best you can say is that data is too confounded to conclude anything.

But since there are at least two papers reporting very low numbers of smokers for the current virus, same as reported for SARS, it seems their “adjustments” lead to the wrong conclusion (they also misinterpret lack of statistical significance on top).

https://onlinelibrary.wiley.com/doi/pdf/10.1111/all.14238

https://www.thelancet.com/journals/lanres/article/PIIS2213-2600(20)30079-5/fulltext

Stand back and look at the situation. Approximately, you have two groups of people being tested – one is non-smoking hospital workers treating patients with SARS, the other is smokers who came in off the streets worried about their cough. It turns out the former group has a lot more chance of coming up positive.

Put all the statistics aside. Can you reasonably believe that this is good evidence smoking protects you from SARS in some mysterious and unexplained way, or is it fairly obvious that something else is going on?

No, not good evidence alone. But when you ran the study based on reports that very few patients were smokers, it supports that. Since then it has been discovered that smokers have altered ACE2 expression in their respiratory tract, which could affect susceptibility to the virus. So I don’t think there is anything magical going on.

My other post has links for all that so is held up in the spam filter.

“Hey look these reports suggest the same thing” is not a rigorous meta-analysis.

Here’s a study directly examining risk factors in MERS-Cov that found smoking to increase infection risk:

https://wwwnc.cdc.gov/eid/article/22/1/15-1340_article

Here’s a meta-analysis that identifies two studies connecting smoking and increased coronavirus mortality.

https://link.springer.com/article/10.1186/s12889-018-5484-8

Smokers are 30% of the population but 2% of the patients, smoking is said to alter expression of the receptor for the virus in the respiratory tract, and a deficit in smokers was also observed for a very similar virus 2 decades ago.

I’d conclude they should get more data and see what’s going on here. Seems likely to me it is protective.

Makes sense since MERS isn’t supposed to depend on ACE2:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6357155/

So now we have the further evidence that this phenomenon seems limited to viruses that enter the cell via ACE2, as predicted.

There’s other papers pointing out that smoking upregulates ACE2 and therefore is likely to increase succeptibility to these viruses. In any case, a methodology that compares hospital self reported smoking prevalence amongst a subset of infected to national survey aggregate averages, absent any sort of control, is fundamentally unsound. If you think your methodology is that compelling, why don’t you write a paper on it and see if you can get it through review. I’m tired of continuing with this.

If you were being honest you would give the link. In fact I already cited one like that, but actually they reported that while it was higher overall it was also expressed in different cell types: https://www.medrxiv.org/content/10.1101/2020.02.05.20020107v2

The truth is that many “health experts” would rather people get severe pneumonia and possibly die from the virus than even have this possibility that smoking has a health benefit be investigated properly.

That is why this evidence is all the more compelling, but of course not definative. It is surprising it got out there at all.

My other post didn’t appear yet, but I wanted to add:

Also, tobacco smoke seems to effect the expression of the protein the virus uses to gain entry to the cell (ACE2).

https://www.medrxiv.org/content/10.1101/2020.02.05.20020107v2

https://www.ncbi.nlm.nih.gov/pubmed/30088946

Here is a good example where they managed to get 600 million different coefficients:

https://statmodeling.stat.columbia.edu/2019/08/01/the-garden-of-forking-paths/

So what’s your personal decision rule?

Let’s say you get migraines a few times a year. The only thing that helps is a pill which lessens but doesn’t eliminate the symptoms and it doesn’t always work. It costs $10 and has no major side effects.

Now there’s a new pill. No major side effects, costs $20. There’s been a clinical trial which indicates it works more often and improves the symptoms to a greater extent.

If the clinical trial involved three people and the pill worked better for two of them, I don’t think any of us would be convinced to switch on that evidence.

If the new pill were on the market a year and it completely cured the migraines for 999 out of 1,000 people who tried it I think we would all try the new pill.

But the question is, what exactly would you consider convincing, numerical, statistical evidence that would make you pay a little extra for more pain relief?

That’s even too complicated. :) No one has yet even responded to my simpler question:

“Speaking of thresholds, at what point do you consider a coin to be not fair in say 100 flips? 99 heads? 90 heads? 80 heads? 67 heads? At least so many heads in so many replications of well-designed experiments? Whatever the sponsor says?”

Justin

I’ll respond to your question. I am not willing to answer this without more context. There may be circumstances where I would need 90 heads and others where 95 are needed, and yet others where 60 will do. Without some idea of the costs and benefits of my decisions (as well as factors such as additional information that may become available, (ir)reversibility of decisions, etc.), I cannot give a single answer to this question. And that is the point, I believe. Embracing uncertainty means recognizing that no bright threshold is useful. It does not mean that thresholds aren’t useful or needed in particular circumstances. But without those circumstances, the threshold is counterproductive.

+1

Just to note that this is a version of the hoary sorites paradox, one example of which is how many hairs make a beard?

“how many hairs make a beard?”

42

I think that’s pretty well established.

“No one has yet even responded to my simpler question:”

What number do you ensure with absolute certainty is a fraudulent number of heads?

I believe the correct answer is 101, right?

As the number of heads increases, the likelihood that the coin has been tampered with also rises. But with only 100 flips it can never be assured. The answer to your question becomes a context-dependent value judgement rather than a simple probability problem: at what point does the benefit of prosecuting for tampering outweigh the risk of a false conviction?

A philosophical note on binary thinking. Yes, binary thinking is natural. Children at a young age learn to say, No! But when we get down to serious thinking, binary thinking is problematical. Many thinks are questions of degrees. Between black and white are shades of gray. Binary thinking has difficulty with change, as it cannot explain how A becomes not-A. Ancient Chinese philosophy had the binaries of Yin and Yang, but they were not static. Old Yin became Yang and old Yang became YIn. We are used to the idea that a proposition is either true or false, and that if it is false that proposition P is false, then P is true. This type of logic may be traced back to the ancient Greek stoics. We may call such logic Aristotelian, but it is not. Aristotle, for instance, thought that no statements about the future were true. Neither “It will rain tomorrow” nor “It will not rain tomorrow” was true, according to him. Indian philosophers regarded some propositions as neither true nor not true. Today, we have multivalued logics, including fuzzy logic and Bayesian probability, since Bayesian probability assigns probabilities to propositions.

Some on this thread may be unaware of one cause of the recent flurry of conversations on these topics: our direct proselytizing last April and May of 247 editorial (non-statistical) editorial boards across the natural, behavioral and social sciences with a very simple suggestion, as described below:

*********************

A Novel Petition to Editorial Boards: Disallow “Statistically Significant”

Stuart H. Hurlbert, Richard Levine and Jessica Utts, April 2019

Editorial boards need to step up to the plate

We proposed in our 2019 paper, Coup de grace for a tough old bull: “statistically significant” expires (attached), that disallowance of the phrase “statistically significant” in the reporting of statistical analyses was a simple, well-understood and long-needed reform. And that its implementation did not have to wait for professional statisticians and textbook authors to get up to speed but rather could be initiated by alert editorial boards of journals in all disciplines. So, following publication of Coup we sent to the editorial boards of 247 journals (list given below) 1) a copy of Coup, 2) a cover letter to 2-30 editors for each journal similar to that below for the Journal of Agricultural Science, and 3) our “modest proposal” (below) wherein we briefly summarize the strong support for the reform. We then left everyone in peace.

_________________________

Date: Wed, May 22, 2019

Subject: Will the Journal of Agricultural Science disallow “statistically significant”?

To: BilsborrowPaul , WisemanJulian

Dear Paul and Julian,

This letter follows up on our earlier correspondence on the verbal dichotomization of the P-scale.

Please find attached a two-page document recommending that editors and editorial boards disallow use of the phrase “statistically significant” in scientific writing. This complements the recent commentary in Nature titled “Retire Statistical Significance” in a very practical way.

Our proposal is an old idea that finally has gained great support from those scientists and statisticians most knowledgeable about the historical literature on null hypothesis significance testing. Big improvements in the quality of the interpretation and reporting of statistical results is needed throughout the sciences. But we don’t have to wait for all scientists to become smarter about statistics, if editors are willing to formalize a minor commonsensical guideline for their authors.

Please do pass this message on to your full editorial board and invite them into the discussion.

Even if the Journal of Agricultural Science declines to implement a formal guideline disallowing “statistically significant,” this short two-pager and its linked articles will at least allow individual editors an understanding of the longstanding and unrefuted arguments in favor of doing so.

With best regards, Stuart H. Hurlbert

***************

A modest proposal: disallow “statistically significant”

Dear Editors,

We write to present a modest proposal for your consideration: that in the instructions to authors for journals under your editorial care it be stated that authors should refrain from use of the phrase “statistically significant.”

If adopted this proposal would have a salutary effect on the reporting and interpretation of statistical analyses. It is time to finally make operational a suggestion made in 1960 by psychologist H.J. Eysenck (Psych. Rev. 67, 269-271):

“It is customary to take arbitrary p values, such as .05 and .01 and use them to dichotomize this continuum into a significant and an insignificant portion. This habit has no obvious advantage, if what is intended is merely a restatement of the probability values these are already given in any case and are far more precise than a simple dichotomous statement. … If the verbal dichotomous scale is not satisfactory – as it clearly is not – the answer surely is to keep to the continuous p scale, rather than subdivide the verbal scale.”

THE PROPOSAL

Our proposal is stated and explained in our short, new, open access paper, “Coup de Grâce for a Tough Old Bull: “Statistically Significant” Expires” (https://doi.org/10.1080/00031305.2018.1543616), which has just been published in a, 401-page, special issue of the widely read journal, The American Statistician (TAS) titled “Statistical Inference in the 21st Century: A World Beyond p < 0.05 (https://tandfonline.com/toc/utas20/current?nav=tocList ):

"We propose that in research articles all use of the phrase “statistically significant” and closely related terms (“nonsignificant,” “significant at p = 0.xxx,” “marginally significant,” etc.) be disallowed on the solid grounds long existing in the literature. Just present the p-values without labeling or categorizing them. …. For a journal an additional ‘instruction to authors’ could read something like the following: There is now wide agreement among many statisticians who have studied the issue that for reporting of statistical tests yielding p-values it is illogical and inappropriate to dichotomize the p-scale and describe results as "significant" and "nonsignificant". Authors are strongly discouraged from continuing this never justified practice that originated from confusions in the early history of modern statistics."

We hope you will regard this proposal worth discussion amongst yourselves and your editorial boards or at least their most numerate members. There is no need for us to be consulted further on the matter though we would be interested to hear of any final decisions for action (or inaction) you may take.

BACKGROUND AND EARLY SUPPORT

A substantive background on this and related issues can be obtained from the 38-page 2009 review article that was the foundation for “Coup” (Hurlbert, S. H., and Lombardi, C. M. (2009), “Final Collapse of the Neyman-Pearson Decision-Theoretic Framework and Rise of the NeoFisherian,” Annales Zoologici Fennici, 46, 311–349. http://www.annzool.net/)) and, for more recent literature, from a 2016 14-pager (Greenland, S., Senn, S.J., Carlin, J.B., Poole, C., Goodman, S.N., and Altman, D.G. (2016), “Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations”, European Journal of Epidemiology, 31, 337-350). True cognoscenti, of course, will want to at least browse the whole 401-page TAS special issue!

There already exists strong support for our proposal among both statisticians and other scientists with close knowledge of the historical literature on null hypothesis significance testing

While “Coup” was under review 48 statisticians and scientists with strong credentials from ten countries endorsed our specific proposal. They are listed in the supplemental materials (Appendix A) for “Coup”.

Three of those statistician endorsers also have just published a somewhat wider ranging, short commentary in Nature that cites and strongly supports our proposal. That commentary has been endorsed by 854 statisticians and other scientists from 52 countries. (Amrhein, V., Greenland, S., and McShane, B. (2019) Scientists rise up against statistical significance. Nature 567: 305-307. https://www.nature.com/articles/d41586-019-00857-9 )

The three editors of the TAS special issue in their introductory editorial are very strong in their support: “The [2016] ASA Statement on P-Values and Statistical Significance stopped just short of recommending that declarations of “statistical significance” be abandoned. We take that step here. We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term “statistically significant” entirely.” (Wasserstein, R., Lazar, N., and Schirm, A. (2019), “Editorial: Moving to a world beyond p<0.05”, The American Statistician, 73(S1): 1-19. https://tandfonline.com/doi/full/10.1080/00031305.2019.1583913).

Finally, a strong majority of the authors of the other 43 articles in the TAS special issue are clearly on board with our neoFisherian proposal though these authors are mostly focused on suggesting additional ways, new and old, to improve statistical practice and reporting.

In conclusion, our proposal stipulates only a simple matter of logic and language, and does not argue either for or against particular methodologies. The only complaint editors and editorial boards are likely to get if they adopt it is how hard it is, at first, to “walk again” without the deceptive crutch of “statistically significant.”

We wish you well in your deliberations.

Stuart Hurlbert, Department of Biology, San Diego State University, hurlbert@sdsu.edu

Richard Levine, Department of Mathematics and Statistics, San Diego State University, rlevine@sdsu.edu

Jessica Utts, Department of Statistics, University of California, Irvine, jutts@uci.edu

______________________

P.S. Some of the material above has been put online as one of the 8 counter-comments following an April 4 article by J.P.A. Ioannidis in the Journal of the American Medical Association, “The Importance of Predefined Rules and Prespecified Statistical Analyses: Do Not Abandon Significance,” https://jamanetwork.com/journals/jama/fullarticle/2730486?appid=scweb&alert=article

**********************

Journals and societies invited to join the disallow “statistically significant” movement

JOURNALS

American Journal of Epidemiology

American Naturalist

Animal Behaviour

Annals of Epidemiology

Annales Zoologici Fennici

Austral Ecology

Biological Conservation

British Medical Journal

Canadian Journal of Fisheries and Aquatic

Sciences

Cell

Conservation Biology

Ecologia Austral

Journal of Agricultural Science

Journal of the American Veterinary Medicine

Association

Journal of Biopharmaceutical Statistics

Journal of Educational Data Mining

Journal of Learning Analytics

Nature

New England Journal of Medicine

New Zealand Journal of Marine and

Freshwater Research

Oecologia

Oikos

Restoration Ecology

Science

The Lancet

¬¬¬¬¬¬¬¬¬¬¬¬¬

SOCIETIES

These societies publish 2+ journals each, as indicated in ( ). These journals are not listed individually, but our letter went to at least several editors and editorial board members for each individual journal in most cases.)

Alliance of Crop, Soil and Environmental

Science Societies (13)

American Anthropological Association (20)

American Association of Pharmaceutical

Scientists (2)

American Economic Association (8)

American Meteorological Society (12)

American Ornithological Society (2)

American Physiological Society (14)

American Psychological Association (ca. 90)

American Society for Microbiology (17)

American Sociological Association (8)

Association for the Sciences of Limnology and

Oceanography (4)

British Ecological Society (6)

Ecological Society of America (5)

Entomological Society of America (9)

Journals of the American Medical Association

Network (13)

Society for Environmental Toxicology and

Chemistry (2)

The Wildlife Society (3)

******************

Thanks for the info.

> “one cause of the recent flurry of conversations on these topics:”

Hmm, the loss of intellectual dominance in an area by majority/minority in the statistical discipline.

Often thought many in the the statistical discipline were overly worried about areas of their dominance others were moving in on and not focusing enough on other’s areas of dominance they could move into.

“Editorial boards need to step up to the plate”

The implication here is “statisticians have done their due diligence in reaching a consensus on eliminating statistical significance, now you need to do yours.”

But to the recipients, it looks like “you need to stop doing what you have always done and do something different. (And if you don’t do it right, we reserve the right to rip you to shreds.)”

I have seen the future, and it looks like this:

1. Saying “statistical significance” is taboo, so the colleagues of the new graduate student will be the ones who have to tell her that she will still only get published if p < .05.

2. Each journal publishes a couple of papers a month that do not even calculate p. In the vast majority of cases, the peer reviewers did not have the slightest idea what the statistical treatment was doing, but expressing their ignorance was not an option, so they just recommended approval.

3. Certain researchers develop a feel for alternative statistical approaches, and become prominent in their field for it. Once their profile rises above the masses, a real statistician publishes a paper lambasting the alternative approach (what is good isn't new, and what is new isn't good.) Go back to No. 1.

4. After many tumultuous years, the field of endeavor finally settles upon standard methods that no longer get criticized. New graduate students are advised to use the standard methods that have been accepted in their field, just as they are now.

I guess my point is that statisticians could simply address No. 4 now, describing standardized research questions in each field and recommending boilerplate statistical approaches. Then we could skip a lot of the tumult. That would be a heck of a lot of work, and statisticians would have to do it. I am dubious that due diligence had been done!

> describing standardized research questions in each field and recommending boilerplate statistical approaches

Here’s the boilerplate statistical approach, it’s quite simple:

Understand the scientific problem you have, encode the problem into a mathematical model, and sample from that model.

It’s just as simple as the boilerplate for solving a business computing problem:

Understand the business computing problem you have, choose a computer language that expresses those sorts of problems in an expressive manner and has reasonable performance for your problem, and code the solution in that language.

> “Understand the scientific problem you have, encode the problem into a mathematical model, and sample from that model.”

Bespoken well – but I believe only a minority of statisticians can do that well – at least given current grad school training – and almost no non-statistical scientists at all.

However, many might be able to appreciate the simulation based parts of a Bayesian work flow that can show them how their understanding of the scientific problem is flawed along with ways to improve it and how an analysis models suggested by a statistician for their data would perform in relevant fake data sets they can create or have created for them. Or should I say generative model work flow which would also allow something similar for frequentist techniques.

Very little math required beyond understanding abstract representations called probability models can make collectives of unknown parameters* and fake data repeatedly and this can be used to understand the performance of an proposed analysis on their data set. Allow them to assess adequacy without having to learn how to make adequate analyses on their own. That is not having to a statisticians word for something being an adequate analysis but independently assess adequacy.

* Some may wish to think of the collectives of unknown parameters as plausibilities.

The point was to tongue in cheek point out that it’s just not possible to can statistical procedures, in the general case… Sure if you have to over-and-over solve a simple problem, you can can that. But this isn’t the case for real research any more than it’s the case for real business computing problems. Hence, the trillions of dollars spent over the last 2 decades on writing and customizing software that solves all sorts of specific problems for individual businesses.

Matt, I offer some counterpoints:

1. There is no implication that statisticians collectively have done “due diligence” or reached “consensus,” as only a small fraction of them have studied the historical literature in any depth.

2. If someone can’t defend what they or others “have always done,” then perhaps they should stop doing it. Of course there will always be those who will try “to rip you to shreds” even if you can strongly defend your position.

3. “New graduate students” should take with a grain of salt any statistical advice from “their colleagues.”

4. Yes, indeed, the literature reflects that lots of authors, reviewers and editors are not up to the jobs they are tasked with. And lots of good papers have no need of P values. A decade ago a Belgian ecologist editor w/ poor training in statistics rejected a major paper of mine on trends in fish and bird populations that contained a lot descriptive statistics but no significance assessments. The paper got two very favorable reviews from anonymous reviewers. But the editor demanded we do some “hypothesis tests” and “present some P values” but had no suggestions for in what parts of the long paper he wanted these. We refused, he rejected the paper, and many colleagues in sympathy and w/ mss already accepted by the editor, withdrew their ms, and we all published together in a special issue of another journal.

5. “Real statistician”? I’m not one, I only have a degree in ecology. But I, my not “real statistician” colleagues and students have never had any difficulty in finding (and documenting in publications) gross statistical errors in papers authored and/or approved by “real statisticians.”

6. “standard methods.” There’s no question that in this area most statisticians and most other scientists “follow the crowd” in many ways. They’re not likely to be the ones that improve a field.

Matt:

Your “I have seen the future” story does not match my experience.

“Your “I have seen the future” story does not match my experience.”

I am very curious as to the post-NHST world you anticipate. That goes for Stuart as well. Not what should happen, that has been made clear to me by many of the folks here (and I am on board FWIW), but what will happen.

In any event, I have made my prediction, right or wrong, I won’t belabor it any more. I’m just going to grab some popcorn and see what happens.

Matt:

I have no predictions. What I meant is that your statements were not consistent with what I’ve seen in the past.

You wrote the following:

Because you wrote, “I have seen the future, and it looks like this,” I assumed you were describing your experiences, or those of others you’ve spoken with. What I’m saying is that these were not my experiences:

1. You wrote, “she will still only get published if p < .05." I've published lots of things without any p-values at all. 2. You wrote, "Each journal publishes a couple of papers a month that do not even calculate p. In the vast majority of cases, the peer reviewers did not have the slightest idea what the statistical treatment was doing...". First, this contradicts what you wrote in item 1. Second, assuming I've been one of the authors of the "couple of papers a month that do not even calculate p," I think the peer reviewers

didknow what I was doing.3. I’ve published lots of applied papers without p-values. But nobody came along and lambasted my approach, sending me back to p-values.

4. My field of endeavor (political science) has not “settled upon standard methods that no longer get criticized.” People continue to use and develop new methods. Yes, there’s lots of bad work published in our field, work that uses statistics to come to misleading conclusions—that’s something that can be done with or without p-values!—but our graduate students are encouraged to innovate, not merely to use standard methods.

So, the future that you have seen, is nothing like the past I’ve experienced.

I’m pretty sure no-one is going to question your competence to choose statistical methods though Andrew, I mean they may have an opinion that one method or another is better, but they won’t basically say “this is all crap where are the p values? this is all probably just random noise!”

On the other hand, when Marcia Martin, grad student 3rd year, publishes her n=43 survey of whatever, and her advisor has no specialty in statistics and no-one on the paper has a track record of lots of statistical expertise, but they’ve all been reading your blog and thinking hard about how to do a good job…. and they spring some of that new fangled stuff on some 65 year old reviewers who last had a stats course in 1981, what will the reviewers say to her?

Hi, all,

I’m a bit dismayed that in 2020 an opportunity is again missed, in this case with the the “ASA Task Force on Statistical Significance and Replicability”. The missed opportunity: the focus on (or biased target of) statistical significance. Why not something like the “ASA Task Force on Statistical Inference and Replicability”, or the “ASA Task Force on Learning from Data and Replicability”, or the like?

It just seems to me that in trying to be helpful in clarifying what statistical significance is or is not, they are (unintentionally?, unwillingly?) going back onto reifying statistical significance as something worth reifying.

Just saying…

Jose:

Good point!

+1

Here’s what I want to know after having spent (a lot of) time with my youngest son’s textbooks: why does my state include teaching box and whisker plots as part of its pre-middle school curriculum and why are 7th graders in algebra given (appalling) lessons about statistical significance before they’re taught how to solve quadratic equations by completing the square? Somewhere out there is a group of people hellbent on propagating statistical overconfidence. Anyway, with Martha’s notes in hand I’m trying to impart some knowledge about it all to the little guy. Further dispatches from the front to follow as time permits.

The ASA has long had an interest in the promotion of statistics in middle and high schools but I’m not up to date on current efforts or materials. As with Thanatos, my interest was heightened when my son was in high school. With a small nudge from me he tried to sign up for an intro stats course, only to be told by a counselor not to do so as the course was designed to help the not-so-smart students pass a math requirement.

Ironically for all students a GOOD intro stats course should be essential and more useful than, say, calculus, as every day the popular media contains news articles based on the statistical evidence of risks or benefits from something or other, of opinion on all sorts of political, economic and social issues. Teachers with a strong background in math but not one in statistics are, probably, too often assigned to teach stats courses. Same perhaps true in some community or junior colleges.

I was once invited to be a speaker at a ceremony where high school students were being give awards for (possibly Science Fair) projects on statistics. In thirty minutes, with simple figures AND NO MATH, I did a pretty good job I think for the entire audience (lots of parents, non-statistician teachers, etc.) or explaining the core concepts of exptl design (experimental unit, randomization, replication, blocking) in the context of an ag expt, the nature of pseudoreplication, and the principles underlying t-tests. The collective student reaction seemed to be, “Well, duh! That’s common sense.” I would argue that any intro stats course in a university should start out the same way. Unfortunately too many mathematically oriented statisticians treat exptl design as an afterthought or ‘special topic’ — ASA needs a new task force just to get the horse back in front of the cart!

> impart some knowledge about it all to the little guy.

Careful, the teach may fail them for not being able to grasp the “correct” answers ;-)

Generally, I believe it is just the culture of most you teach statistics in university and elsewhere, display a lot of definiteness and certainty about the material you are teaching. Being forthright about things being more or less sensible (not a binary right/wrong) actually upsets students. For one among many reasons, they know their evaluation will largely be based on their answers to questions taken to have right or wrong answers.

Thinking it would be of value to all interested parties and helpful in assessing beforehand the likely success and cogency of the new task force’s final report, I sent the following suggestion to all 15 members of the task force:

Dear members of the new ASA task force:

There is wide interest in the statistical community in the spectrum within your task force of philosophical opinion within your task force on two specific issues. If each of you would respond to the two questions below, I will keep your individual responses confidential but report back to you and others the number of Yes’s and No’s on each question.

FIRST, at the present moment do you generally agree with this statement by Wasserman, Lazar & Schirm in their 2019 editorial: ““The [2016] ASA Statement on P-Values and Statistical Significance stopped just short of recommending that declarations of “statistical significance” be abandoned. We take that step here. We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term “statistically significant” entirely” ?

SECOND, at the present moment do you generally agree that when multiple statistical tests or comparisons are conducted in the context of a single study or experiment, the alpha or critical P value used to assess signficance in the individual tests must be adjusted to achieve a fixed set-wise or family-wise type I error rate, with set defined as either all tests or a subset of them.

*****************

One person voted and then task force co-chairs nixed the effort:

Dear Stuart,

Thank you for reaching out to us. Currently our task force is working to provide a statement of consensus that we will submit to the ASA Broad. We recognize the importance of the charge to the task force and the diverse opinions in our community regarding the use of P-values. We will keep track of all the comments we receive, but will not be sharing individual opinions until we complete our work. We appreciate your understanding.

Yours, Xuming (on behalf of Linda Young and Xuming He; co-chairs of the task force)

*******************

To this I responded, w/ cc’s to entire task force:

Dear Xuming and Linda,

I appreciate your desire for a little privacy for your deliberations as the task before you is, perhaps, an order of magnitude more difficult than that of the first task force — and your report is due in November!

On the other hand, the view “from the outside” is that task force members are mostly senior people with a lot of teaching and research experience and pretty settled views on even controversial topics. And a task force of 15 professionals is not likely to want to solicit much outside opinion or to want any sort of popularity contest determining the recommendations of your report.

But that means the pre-existing philosophical stances of the task force members on the two questions I posed to a large extent will have already predetermined the general nature of your report. This is inevitable. In the interests of transparency and a positive reception of the report, it would seem desirable for the initial general philosophical stance of the task force be known to outsiders at the outset of the process, not just at the end. I suggested to Karen Kafadar that she supervise a poll like this herself but she demurred. Particiipating in it would in no way bind a task force member to stick with his/her initial preference.

But I’m flogging a dead horse I guess.

As someone who has been investigating, teaching, dealing with editors and publishing on these issues decades before any initiatives on the part of ASA, my initial contribution to the task force is to suggest it take into account the following publications by myself and colleagues which have challenged various statistical “traditions” (all attached):

Hurlbert,S.H. and C.M. Lombardi. 2009. Final collapse of the Neyman-Pearson decision theoretic framework and rise of the neoFisherian. Annales Zoologici Fennici 46:311-349.

Hurlbert, S.H. 2009. The ancient black art and transdisciplinary extent of pseudoreplication. Journal of Comparative Psychology 123: 434-443.

Lombardi, C.M. and S.H. Hurlbert, 2009. Misprescription and misuse of one-tailed tests. Austral Ecology 34:447-468

Hurlbert, S.H. and C.M. Lombardi. 2012. Lopsided reasoning on lopsided tests and multiple comparisons. Australian and New Zealand Journal of Statistics 54:23-42.

Hurlbert, S.H., R. Levine and J. Utts. 2019. Coup de grace for a tough, old bull: “statistically significant” expires. The American Statistician 73(sup 1):352-357.

*****************

Stay tuned (if it’s a slow day and you have patience!)

> just not possible to can statistical procedure

Fully agree.

My point was mainly about the futility of canned assessment methods to determine the adequacy of claims that statistical analyses will be adequate for the purposes/needs of an investigator. On the other hand, a good statistical workflow will be bespoken for particular purposes/needs of an investigator. I am hopeful investigators will be better able learn how to do those as they are mainly simulation based and largely mechanical rather than mathematical exercises. At least until a need is recognized for improvement in a proposed analysis. But they then will know it is currently inadequate for their purposes/needs.

It seems to me that all of the contributions to this discussion are missing an important distinction. There’s a big difference between the case when a professional statistician is involved and the far more common case in which an experimenter calculates p values themselves.

It is the latter case where most problems occur and where change in practice is most urgent. And it was to the amateur p-value calculator that I addressed section 2 “How Can Statistical Practice by Users and Journals Be Changed?” in my TAS2019 contribution:

https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1529622

There can, surely, be no doubt about the following statement.

“The standard approach in teaching, of stressing the formal definition of a p-value while warning against its misinterpretation, has simply been an abysmal failure.” (Sellke, Bayarri and Berger 2001, p. 71)

In TAS2019, only 7 of the 40+ contributions were classified under the heading ‘Supplementing or Replacing p’. If any of these proposals is to have the slightest chance of gaining wide acceptance it has to be easy to explain and easy to calculate. I liked Robert Matthews Ancred suggestion, but I fear that it’s too complicated for amateurs.

In cases where it’s appropriate to test a point null (a familiar idea to almost all users) then the proposal by Benjamin and Berger (https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1543135 ) is perhaps the simplest.

“When reporting a p-value, p, in a test of the null hypothesis H0 versus an alternative H1, also report that the data-based odds of H1 being true to H0 being true are at most 1/[−ep log p], where log is the natural logarithm and e is its constant base.”

“Determine and report your prior odds of H1 to H0 (i.e., the odds of the hypotheses being true prior to seeing the data), and derive and report the final (posterior) odds of H1 to H0, which are the prior odds multiplied by the data-based odds. Alternatively, report that the final (posterior) odds are at most the prior odds multiplied by 1/[−ep log p].”

This is very simple to calculate, though not so easy to understand for amateurs.

Ideally one would like to supplement the p value and CI with an estimate of Pr(H0 | p = p_obs) where p_obs is the observed p value. This is ideal because it is what many, even most, users still seem to think that it’s what the p value represents. That makes it easy to understand for users. The problem is, of course, that there is no unambiguous way to calculate it.

Even in cases where it’s sensible to test a point null, there is still an infinitude of ways to specify the alternative hypothesis. The recommendation of Benjamin and Berger is, in many ways, similar to mine. I suggest testing the point null against the best supported alternative (a simple alternative in the notation of Held & Ott, 2018). This has the advantage that the Bayes’ factor is simply the likelihood ratio, and this is an entirely frequentist concept, independent of the prior. This makes it much easier to explain to amateurs than the more general Bayes’ factor.

Here is the compromise that I favour at the moment (https://royalsocietypublishing.org/doi/10.1098/rsos.190819#d3e786 ).

“I would be quite happy for people to report, along with the p-value and confidence intervals, the likelihood ratio, L_10, that gives the odds in favour of there being a real effect, relative to there being no true effect. That is a frequentist measure and it measures the evidence that is provided by the experiment.”

“If these odds are expressed as a probability, rather than as odds, we could cite, rather than L_10, the corresponding probability 1/(1 + L_10). I suggest that a sensible notation for this probability is FPR_50, because it can, in Bayesian context, be interpreted as the False Positive Risk (FPR) when you assume a prior probability of 0.5. But because it depends only on the likelihood ratio, there is no necessity to interpret it in that way, and it would save a lot of argument if one did not.”

“I think that the question boils down to a choice —do you prefer an ‘exact’ calculation of something that cannot answer your question (the p-value), or a rough estimate of something that can answer your question (the FPR). I prefer the latter.”

Although my suggestion, and Berger’s are based on different definitions of the alternative hypothesis, the good thing is that their predictions of the false positive risk are sufficiently close that they would lead to much the same conclusions in practice (see Table 1 and Figure 3 in https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1529622 ). Valen Johnson’s UMPBT approach, though based on yet another approach to defining the problem, also leads to similar conclusions. All these approaches lead one to conclude that if you observe p = 0.05 in a single unbiased, well-powered, experiment and conclude that your result is unlikely to be a result of chance alone, the probability that you are wrong, the false positive risk, is between 0.2 and 0.3. And if the prior odds on H1 are lower than one, it could be much higher.

The major reason for the fact that users are confused about what to do about p values is, I think, that the statistical community is perceived as being preoccupied with internal battles and has been unable to coalesce around a simple recommendation about what should be done even in a simple case like comparing the means of two independent samples. No doubt it is too much to hope that they could compromise sufficiently to recommend to amateurs procedures like Berger’s or mine.

“This has the advantage that the Bayes’ factor is simply the likelihood ratio, and this is an entirely frequentist concept, independent of the prior. This makes it much easier to explain to amateurs than the more general Bayes’ factor.”

And for what value is the BF something like ‘statistically significant’? 3? 3.01? 15?

Back to square one.

Justin

@JustinSmith

Good question. It will certainly give addicts of p < 0.05 pause for thought if they realise that when they observe a p value close to 0.05 then the likelihood ratio in favour of there being a real effect is only about 3:1 when they had wrongly assumed that the odds were 19:1.

Nevertheless, amateur calculators of p values (who are responsible for a large majority of then) are not accustomed to thinking in terms of odds. That's why I advocate expressing the likelihood ratio as a probability, 1/(1 + L_10). I suggest that a sensible notation for this probability is FPR_50, because it can, in Bayesian context, be interpreted as the False Positive Risk, FPR = Pr(H0 | p = p_obs), when you assume a prior probability of 0.5. Admittedly there is no unique way to calculate the FPR but it does have the huge advantage that it tells you what most amateurs still think that the p value tells you, so very little adjustment of thought is required.

You say " for what value is the BF something like ‘statistically significant’?" Like most people here, I think that the term ‘statistically significant’ should never be used. Just give the p value, the CI and the FPR_50, and let the reader decide what to make of them. Of course you could also describe your own assessment in words. For example, if you had observed p = 0.047 in a well-powered experiment, so FPR_50 = 0.26, you might say something like the following. Despite the fact that p < 0.05 the probability that your results are just chance is at least 26%. Therefore the results are no more than suggestive, and more experiments would be needed to confirm whether or not the effect is real or just chance.

I wonder if it’s an interesting thought experiment to consider:

What about writing a journal paper *with no analysis at all*? Just a methods section, and then tables and tables of raw measurements. The reader, we suppose, can then look at the paper, and make whatever decision they wish taking into account all the factors they desire.

Do we consider such a paper to be publishable, desirable even?

I’d argue that the answer is, in the majority of cases, surely not. We expect papers to include not just the data, but some degree of interpretive expertise from the authors. There has to be some degree of distillation of results, some degree of guidance in terms of what a reader *ought* to be taking notice of and what should be considered less important. Statistical significance alone is obviously bad, but p-values and similar summary metrics can be usefully part of that narrative.

I’m very hesitant to believe that it’s just enough to ‘present the uncertainty’ and suppose the reader will embrace it correctly. A smart reader might be okay with that, but a smart reader wouldn’t be tricked into bad decisions by “p<0.05 ?" either. I think we can't just ignore the people who are like "I Just Want Nate Silver to Tell Me It's All Going to Be Fine"…

This sounds not terribly different than a Registered Report — and these work well!

https://www.nature.com/articles/d41586-019-02674-6

It wouldn’t be at all fair on the reader to leave all the analysis to them. And that would be no more desirable for a Registered Report than for any other paper.

I think I was unclear. I was responding to this:

> What about writing a journal paper *with no analysis at all*? Just a methods section

When a registered report is reviewed and accepted, that perfectly describes it. Then the data is collected, the analysis performed, and the results filled into the paper and it is published. No need for the reader to do the analysis (although usually they can if they wish because registered report are often open data/code).

I also agree with Dale below.

Certainly your straw proposal (journal paper with no analysis at all) seems absurd. But what I think would not be absurd is a journal that publishes data, along with a careful description of how it was collected, what the measurements mean, etc. In other words, all methods and data (not summary tables, but the raw data). I wouldn’t prevent people from also doing analysis, but I think getting credit for publishing a well curated data set would go a long way towards rectifying the current imbalance where it is the analysis that gets rewarded and not the creation of the data.

+1