Filling/emptying the half empty/full glass of profitable science: Different views on retiring versus retaining thresholds for statistical significance.

Posted on November 4, 2019 3:33 PM by Keith O’Rourke

This post is by Keith O’Rourke and as with all posts and comments on this blog, is just a deliberation on dealing with uncertainties in scientific inquiry and should not to be attributed to any entity other than the author. As with any critically-thinking inquirer, the views behind these deliberations are always subject to rethinking and revision at any time.

Unless you are new to this blog, you likely will know what this is about.

Now, by profitable science in the title is meant repeatedly producing logically good explanations which “through subjection to the test of experiment experiment, to lead to the avoidance of all surprise and to the establishment of a habit of positive expectation that shall not be disappointed.” CS Peirce

It all started with a Nature commentary by Valentin Amrhein, Sander Greenland, and Blake McShane. Then the discussion , then thinking about it , then an argument that it is sensible and practical , then an example of statistical significance not working and then a dissenting opinion by Deborah Mayo .

Notice the lack of finally!

However, Valentin Amrhein, Sander Greenland, and Blake McShane have responded with a focused and concise discernment why they think retiring statistical significance will fill up the glass of profitable science while maintaining hard default thresholds for declaring statistical significance will continue to empty it. Statistical significance gives bias a free pass. This is their just published letter to the editor (JPA Ioannidis) on TA Hardwicke and JPA Ioannidis’ Petitions in scientific argumentation: Dissecting the request to retire statistical significance, where Hardwicke and Ioannidis argued (almost) the exact opposite.

“In contrast to Ioannidis, we and others hold that it is using – not retiring – statistical significance as a “filtering process” or “gatekeeper” that “gives bias a free pass”. “

A two sentence excerpt that I liked the most was “Instead, it [retiring statistical significance] encourages honest description of all results and humility about conclusions, thereby reducing selection and publication biases. The aim of single studies should be to report uncensored information that can later be used to make more general conclusions based on cumulative evidence from multiple studies.”

However, the full letter to the editor is only slightly longer than two pages – so should be read in full – Statistical significance gives bias a free pass.

I also can’t help but wonder how much of the discussion that ensued from the initial Nature commentary could have been avoided if less strict page limitations had been allowed.

Now it may seem strange for an editor who is also an author on the paper drawing a critical letter to the editor – accepts it. It happens, but not always. I also submitted a letter to the editor on this same paper and the same editor rejected it without giving a specific reason. That full letter of mine is below for those who might be interested.

My letter was less focused but had three main points. Someone with a strong position on a topic that undertakes to do a survey themselves displaces the opportunity for others without such strong positions to learn more, univariate summaries of responses can be misleading and pre-registration (minor) violations and comments (only given in the appendix) can provided insight into the quality of the design and execution of thw survey. For instance, the authors had anticipated analyzing nominal responses with correlation analysis.

27 thoughts on “Filling/emptying the half empty/full glass of profitable science: Different views on retiring versus retaining thresholds for statistical significance.”

Sameera Daniels on November 4, 2019 8:29 PM at 8:29 pm said:

As an editor, I would have held off publishing the exchange back and forth between John Ioannidis & Valentin Amrhein et al at least another a month after the availability of the 43 Tandline Stat Sig articles b/c, in my view, the exchange overshadowed the potential for an even more robust discussion about the Tandline articles, which seemed, at least to a few of us non-experts following statistics controversies, to be left simmering on a back burner.

Re Survey: I welcome any query [including surveys] that sheds light on the sociology of expertise, for the latter has continued to prevent or hasten the potential for better decisions.

Lastly, I thought that the survey excluded the email addresses when all was said and then. Sorry just didn’t pay that much attention to the chronology of the surveying.

Reply ↓
Daniel Lakeland on November 4, 2019 9:18 PM at 9:18 pm said:

I could pay $42 to download the full text of a 2 page commentary, or I could buy the entire CRC handbook on approximate Bayesian computation … not even close

Reply ↓
- Valentin Amrhein on November 18, 2019 9:23 AM at 9:23 am said:
  
  It’s finally open access: https://onlinelibrary.wiley.com/doi/full/10.1111/eci.13176
  
  Reply ↓
  - Daniel Lakeland on November 18, 2019 12:22 PM at 12:22 pm said:
    
    Thank you!
    
    Reply ↓
  - Daniel Lakeland on November 18, 2019 12:28 PM at 12:28 pm said:
    
    Having read the commentary I am 100% on board. I recently blogged an example of using statistics to make a strong claim from apparently strong evidence such as RCTs. One example of the evidence I looked into found basically that the thing they were studying seemed to occur but the noise was so much that it didn’t reach statistical significance, when they looked at an aggregate measure with the highest power to detect the effect, the effect was there, but then because it was only one of 5 or 10 things they looked at that reached “statistical significance” they basically said that it didn’t count for much… maddening that people are using the noise in their study as “evidence” that nothing is happening…
    
    http://models.street-artists.org/2019/11/08/rcts-are-not-a-club-and-other-stories-of-science-as-rhetorical-gloss/
    
    Reply ↓
Justin Smith on November 4, 2019 10:09 PM at 10:09 pm said:

““In contrast to Ioannidis, we and others hold that it is using – not retiring – statistical significance as a “filtering process” or “gatekeeper” that “gives bias a free pass”.”

But it didn’t actually work that way in BASP when it wasn’t allowed. Ricker et al in “Assessing the Statistical Analyses Used in Basic and Applied Social Psychology After Their p-Value Ban” (https://www.tandfonline.com/doi/abs/10.1080/00031305.2018.1537892) write

“In this article, we assess the 31 articles published in Basic and Applied Social Psychology (BASP) in 2016, which is one full year after the BASP editors banned the use of inferential statistics…. We found multiple instances of authors overstating conclusions beyond what the data would support if statistical significance had been considered. Readers would be largely unable to recognize this because the necessary information to do so was not readily available.”

Also, please read “So you banned p-values, how’s that working out for you?” (http://daniellakens.blogspot.com/2016/02/so-you-banned-p-values-hows-that.html) by Lakens.

Using public misunderstanding of ‘chocolate good for you’ vs ‘chocolate now bad for you’ type of argument is silly (although they use results from a frequentist survey to make their point, oddly enough). Frequentist methods of course allow for error, and this ‘flip flopping’ based on stochastic data or different findings/experiments would not be solved by Bayesian methods either, and a meta-analysis would be more ‘conclusive’ than individual studies.

Moreover, “bias” with alpha levels is being used as a boogeyman. P-values were used in recent Nobel prize winning science as well as analyzing data to determine quantum supremacy. See http://www.statisticool.com/nobelprize.htm and http://www.statisticool.com/quantumcomputing.htm for some examples.

Justin

Reply ↓
- Anoneuoid on November 4, 2019 10:48 PM at 10:48 pm said:
  
  Yep, if they are still testing a strawman hypothesis it is still NHST, doesn’t matter how you do it.
  
  And I have no idea why you would have such respect for the Nobel prize, it is a political event.
  
  And quantum supremacy… No one has used quantum anything for any practical purpose. Just like GR, there are empirical fudges to the classical theory they use supposedly based on the “real” theory, but no one actually uses quantum mechanics for real stuff.
  
  Reply ↓
  - Anoneuoid on November 4, 2019 11:04 PM at 11:04 pm said:
    
    Eg, if they were actually using quantum mechanics to make cpus these strange behaviors would be predicted beforehand:
    
    Quantum effects are becoming more pronounced at the most advanced nodes, causing unusual and sometimes unexpected changes in how electronic devices and signals behave.
    
    Quantum effects typically occur well behind the curtain for most of the chip industry, baked into a set of design rules developed from foundry data that most companies never see.
    
    […]
    
    That little bit further away from the interface is a quantum effect, and at 130/90/65nm it became a measurable delta in the behavior of the inversion capacitance. We went off and studied, learned it, and built it into our predictive device models.
    https://semiengineering.com/quantum-effects-at-7-5nm/#comment-238213
    
    Instead, it is all empirically determined deviations from the classical models.
    
    Reply ↓
  - Andrew on November 4, 2019 11:29 PM at 11:29 pm said:
    
    Anon:
    
    “No one has used quantum anything for any practical purpose.” What about, oh, I dunno, transistors? Lasers? Atomic bombs???
    
    Reply ↓
    - Anoneuoid on November 4, 2019 11:45 PM at 11:45 pm said:
      
      Check the semi-engineering link for an example of the state of the art. It is all post hoc quantum explanations for empirical fudges. It is just too computationally expensive to a priori calculate the consequences of QM for any practical application… so far.
    - Daniel Lakeland on November 5, 2019 12:08 AM at 12:08 am said:
      
      This has a component of truth. Generally my impression is some people use intuition about quantum things to guess that some phenomenon should happen, and then for the most part to tune that thing a bunch of experiments are run. Noone figures out that the optimum doping rate is 2.3% or whatever by first principles QM calculations.
      
      The same thing is still sort of true for CFM, you use computational fluid mechanics to show that adding dimples lowers drag or whatever, but you still validate in a wind tunnel. the CFM is mainly a way to avoid running so many physical tests.
      
      Lots of molecular Dynamics are Newtonian calculations with tweaks to the potential energy that account for effective properties of QM… like multi-body potentials or whatever. That’s what my solid state physics friend said anyway.
    - Anoneuoid on November 5, 2019 10:35 AM at 10:35 am said:
      
      Yes, essentially what they use in practice could be called “inspired by” QM. But due to the heavy empirical component, afaict knowledge of QM is not necessary in any way to make a modern CPU/etc.
    - Keith O'Rourke on November 5, 2019 5:54 PM at 5:54 pm said:
      
      This might be informative for some – https://www.youtube.com/watch?v=GKnfVA1v5ow
- Thanatos Savehn on November 5, 2019 2:33 AM at 2:33 am said:
  
  And in my world the ONLY thing the courts have picked up on in this debate is that statistical significance is out and thus: “Plaintiffs’ expert submitted an expert report indicating the two corrective disclosure dates at issue here returned p-values of 0.234 and 0.233. These p-values suggest there is a 77 percent chance …the corrective disclosures … negatively impacted EZCORP’s stock price … What they do not suggest is that the misrepresentation ‘did not affect the stock price.'” Nowadays if a man’s savings are to be seized all that’s required is p=.49 or less. https://scholar.google.com/scholar_case?case=17598698939675929167
  
  Perhaps a b-value, for bias, is in order.
  
  Reply ↓
  - Sameera Daniels on November 5, 2019 8:05 AM at 8:05 am said:
    
    According to Susan Haack, the George Mason Law & Economics program has been featuring briefings and conferences about Science & the Law in litigation. It’s time for another robust effort given that several science/law educational efforts were defunded.
    
    Reply ↓
  - Dale Lehman on November 5, 2019 9:08 PM at 9:08 pm said:
    
    Judging Science by Foster and Huber goes into this in great detail. It is an analysis of the Daubert Supreme Court Decision. In that case, the argument of p=.49 or less was actually presented – and the subject of considerable expert and legal argument. It makes very interesting reading.
    
    Reply ↓
- Sameera Daniels on November 5, 2019 8:31 AM at 8:31 am said:
  
  The elephants in the room are ‘conflicts of interests’. One of the underlying fears, rarely stated, is that if we were to arrive at accurate outcomes, we might see a substantial reduction in products and services. We have continued to focus on quantification to an unprecedented degree toward drawing inferences/outcomes. An entire field may be relegated to the dustbin of history.
  
  I think John Ioannidis’ How Evidence-Based Medicine has been Hijacked’ prompts a much deeper discussion. In some sense, the ‘stat sig and p-value’ subject has become more of a focused distraction. Please, please feel free to suggest I’m off base.
  
  Actually, in reviewing Susan Haack’s work, it is uncanny that I have made some of the same arguments even before I came across her recent lectures. Both of us acknowledge substantial skepticism over claims and whistles about the ‘scientific method’.
  
  Reply ↓
- DC on November 5, 2019 1:18 PM at 1:18 pm said:
  
  Ah, I see Justin is STILL talking about the Nobel & why we should all be using NHST because some Nobel winners might have. Fun. RE: the BASP example, I agree that some of the stuff that authors were dong was pretty crummy in that issue. All that revealed to me was that folks trained weakly in stats were submitting papers to the ‘journal without p-values’ and were using it as a means to continue to publish in spite of weak data and/or inability to construct arguments without appeals to ‘significance’. Reviewers shouldn’t have accepted that, any yet they did. Is that an example that misguided bans can be problematic? Absolutely. But, to me, what that really demonstrates is that so many folks in disciplines like psychology, poli sci. etc. quite literally cannot think about data without some benchmark of ‘significance’. Their entire studies are designed around obtaining it, rather than answering a substantive question about the world. That is pretty damning of the quality of research and inference. But, certainly not strong enough evidence to claim that moving away from p-values would be detrimental to science.
  
  Reply ↓
  - Justin Smith on November 6, 2019 4:35 PM at 4:35 pm said:
    
    “Ah, I see Justin is STILL talking about the Nobel & why we should all be using NHST because some Nobel winners might have. Fun.”
    
    Actually, I don’t care what you do DC, :) , but do check out http://www.statisticool.com/nobelprize.htm
    
    And don’t forget about using p-values and statistical significance to assess evidence for quantum supremacy: http://www.statisticool.com/quantumcomputing.htm
    
    Many more examples to come, from vaccinations, fluoride, water safety and other sciencey stuff.
    
    Cheers,
    Justin
    
    Reply ↓
    - Anoneuoid on November 6, 2019 4:53 PM at 4:53 pm said:
      
      Many more examples to come, from vaccinations, fluoride, water safety and other sciencey stuff.
      
      What a strange list. All of those are instances where governments are forcing medical procedures onto their citizens. That would explain the continued funding of NHST by the government…
jim on November 5, 2019 12:37 AM at 12:37 am said:

I don’t think there should be any rules on how people present their data or test their results.

Researchers should use whatever method best elucidates/analyzes the relationships in their data at the relevant scale of observation and for the purpose of the study.

Some approaches that have been or still are in wide use clearly have problems and the scope of conclusions that can be drawn from them has been dramatically overstated. The simple p <0.05 method has some utility at a coarse level for certain purposes. But the occurrence of p < 0.05 alone isn't proof of *any* hypothesis.

But from what I've seen, most research that doesn't replicate "significance" has worse problems than the utilization of p values as the exclusive arbiter of truth. It also has a poor theoretical basis and what theoretical basis does exist relies on unsatisfied and frequently unrecognized assumptions. None of this will change simply by changing the p-value rule.

Also, regardless of how P-values are used, **all** research needs to be tested and replicated multiple times. Again, changing p-value rules won't change that either.

What people should be demanding is much quality experiments, higher quality analytical methods, more open data, ***ALOT*** more general skepticism about results and ***ALOT*** more replication.

Reply ↓
- Daniel Lakeland on November 5, 2019 6:09 PM at 6:09 pm said:
  
  > I don’t think there should be any rules on how people present their data or test their results.
  
  I suggest a “meta rule”. Everyone should make a logical argument about why their conclusion makes sense. Whether you use stats or some other method, it had better follow some logic that isn’t based on verifiably incorrect assumptions.
  
  The attempt to short-cut this is at the heart of most of the examples of bad science we see on this blog. For example from today: https://statmodeling.stat.columbia.edu/2019/11/05/the-incentives-are-all-wrong-causal-inference-edition/
  
  Reply ↓
  - Martha (Smith) on November 6, 2019 10:27 PM at 10:27 pm said:
    
    Daniel said:
    “I suggest a “meta rule”. Everyone should make a logical argument about why their conclusion makes sense. Whether you use stats or some other method, it had better follow some logic that isn’t based on verifiably incorrect assumptions.
    
    The attempt to short-cut this is at the heart of most of the examples of bad science we see on this blog.”
    
    +1/2 –I’d modify it to “Everyone should make a logical argument about why their assumptions and reasoning to their conclusion make sense.”
    
    Reply ↓
Zad Chow on November 5, 2019 8:24 AM at 8:24 am said:

> “I also can’t help but wonder how much of the discussion that ensued from the initial Nature commentary could have been avoided if less strict page limitations had been allowed.”

Well, a lot of the discussion also continued with these two prints:

1. https://arxiv.org/abs/1909.08579

2. https://arxiv.org/abs/1909.08583

especially part 2, which was also discussed on this blog (https://statmodeling.stat.columbia.edu/2019/09/24/chow-and-greenland-unconditional-interpretations-of-statistics/)

Reply ↓
- Keith O'Rourke on November 5, 2019 6:01 PM at 6:01 pm said:
  
  Zad:
  
  I was thinking that if some of the material in those two links had been allowed in the original Nature commentary then there would have been less heated “fire works”.
  
  On the other hand, maybe that sort of outward clash is needed to move forward – maybe that’s why the “two’ links are now so informative.
  
  p.s. in the process of making this https://statmodeling.stat.columbia.edu/2019/10/15/the-virtue-of-fake-universes-a-purposeful-and-safe-way-to-explain-empirical-inference/ compatible with those two links.
  
  Reply ↓
- Sameera Daniels on November 5, 2019 7:33 PM at 7:33 pm said:
  
  Ah, I see that I missed the better part of the prior discussion on Andrew’s blog. What I was trying to suggest in that last discussion was that the extent of disagreements among some experts circles who engage also in technical explanations makes it necessary for those outside those circles [consumers of expertise & patients’ to evaluate the explanations independently b/c conflicts of interests & turf considerations can skew the explanations. I listened to Sander’s February presentation at McMaster’s University, I am in agreement with much of what he said. I don’t think Sander needed to appeal to any authority though lol. Well, he can review what he said. I took away that statistics was in a crisis state. That can be of little comfort to most.
  
  Reply ↓
  - Sameera Daniels on November 5, 2019 7:45 PM at 7:45 pm said:
    
    John Ioannidis would have to respond to your question Zad. In viewing nearly all of John’s lectures, I took away that both stat sig and p-values were misapplied and overutlized in biomedical journals. So I was surprised at John’s response to Valentin et al.
    
    Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Filling/emptying the half empty/full glass of profitable science: Different views on retiring versus retaining thresholds for statistical significance.

27 thoughts on “Filling/emptying the half empty/full glass of profitable science: Different views on retiring versus retaining thresholds for statistical significance.”

Leave a Reply to Daniel Lakeland Cancel reply