Filling/emptying the half empty/full glass of profitable science: Different views on retiring versus retaining thresholds for statistical significance.

This post is by Keith O’Rourke and as with all posts and comments on this blog, is just a deliberation on dealing with uncertainties in scientific inquiry and should not to be attributed to any entity other than the author. As with any critically-thinking inquirer, the views behind these deliberations are always subject to rethinking and revision at any time.

Unless you are new to this blog, you likely will know what this is about.

Now, by profitable science in the title is meant repeatedly producing logically good explanations  which “through subjection to the test of experiment experiment, to lead to the avoidance of all surprise and to the establishment of a habit of positive expectation that shall not be disappointed.” CS Peirce

It all started with a Nature commentary by Valentin Amrhein, Sander Greenland, and Blake McShane. Then the discussion , then thinking about it , then an argument that it is sensible and practical , then an example of statistical significance not working and then a dissenting opinion by Deborah Mayo .

Notice the lack of finally!

However, Valentin Amrhein, Sander Greenland, and Blake McShane have responded with a focused and concise discernment why they think retiring statistical significance will fill up the glass of profitable science while maintaining hard default thresholds for declaring statistical significance will continue to empty it. Statistical significance gives bias a free pass. This is their just published letter to the editor (JPA Ioannidis) on TA Hardwicke and JPA Ioannidis’ Petitions in scientific argumentation: Dissecting the request to retire statistical significance, where Hardwicke and Ioannidis argued (almost) the exact opposite.

“In contrast to Ioannidis, we and others hold that it is using – not retiring – statistical significance as a “filtering process” or “gatekeeper” that “gives bias a free pass”. “

A two sentence excerpt that I liked the most was “Instead, it [retiring statistical significance] encourages honest description of all results and humility about conclusions, thereby reducing selection and publication biases. The aim of single studies should be to report uncensored information that can later be used to make more general conclusions based on cumulative evidence from multiple studies.”

However, the full letter to the editor is only slightly longer than two pages – so should be read in full – Statistical significance gives bias a free pass.

I also can’t help but wonder how much of the discussion that ensued from the initial  Nature commentary could have been avoided if less strict page limitations had been allowed.

Now it may seem strange for an editor who is also an author on the paper drawing a critical letter to the editor – accepts it. It happens, but not always. I also submitted a letter to the editor on this same paper and the same editor rejected it without giving a specific reason. That full letter of mine is below for those who might be interested.

My letter was less focused but had three main points. Someone with a strong position on a topic that undertakes to do a survey themselves displaces the opportunity for others without such strong positions to learn more, univariate  summaries of responses can be misleading and pre-registration (minor) violations and comments (only given in the appendix) can provided insight into the quality of the design and execution of thw survey. For instance, the authors had anticipated analyzing nominal responses with correlation analysis.

Read more.

 

Those with a truly scientific attitude should look forward (or even be thrilled) by opportunities to learn how they were wrong. This survey has provided some. In particular, potential signatories should have been informed about what a signatory was endorsing and perhaps given list of options to choose from. Before I give my own personal sense of why I chose to be a signatory, I believe that I should first disclose that I did not respond to the survey.

This was primarily due to my lack of confidence, initially in the survey software and then the survey itself – especially given the lack of any ethics approval. When I initially read through the online survey to get a sense of the full set of responses before I replied, it accepted my “submission” (without a confirmation prompt). I contacted the authors about the problem and they indicated they could remove my response. Later when I clicked on my link, surprisingly my unfinished survey appeared on my screen containing some initial text I had entered. I had not been warned about this possibility and may well have shared my link with others.

Perhaps given these concerns, I assessed the survey itself more thoroughly. I found it poorly designed and to me ill thought out. Given the disclosure in the supplementary information that the pre-registered protocol had anticipated the examination of correlations between nominal responses – suggests that I may not have been too wrong in my assessment. I would encourage interested readers to read all the open response comments in the supplementary information to get a better sense of other concerns.

My own personal sense, was that as signatory I was endorsing that the commentary was worth serious consideration along with the understanding that various arguments could be disregarded or set aside if found not compelling. Something that was not to be ignored but rather, only possibly disregarded after due consideration. It was comforting to see that in question 8 (that address the expected benefit of the “petition”), 83% of the respondents chose “A: I felt it would draw attention to the argument”. But the authors’ text highlights “almost a third [of respondents]felt it would make the argument more convincing”. Now 31% did choose “B: I felt that it would make the argument more convincing” but the accusation they committed a logical fallacy – the logical fallacy of “argumentum ad populum” – should not immediately follow.

This is unfair for two reasons. First, the sentence “B: I felt that it would make the argument more convincing” can be interpreted descriptively or normatively. A normative interpretation is required for the logical fallacy. Many of the respondent may well have been thinking of it descriptively – although readers should not, many will actually accept the authority of large numbers. Unfortunately as we all know, many readers of methodology papers don’t do what they should but rather what they wish. Second, the response to this question was multivariate (tick all that apply) and the univariate reporting in the main paper is potentially misleading. This is only clear in the supplementary information in table S2. Only 6% of respondents chose B as their sole response. On the other hand, 83% of those who chose B, also chose A. Arguably, the interpretation of the joint response of AB is far less supportive of an accusation of committing the logical fallacy of “argumentum ad populum” than B on its own.

More generally, the authors chose to proceed with the survey without any ethics approval. They indicate some sort advice being sought on this from leadership of QUEST but no information about who they are and what they actually did, was provided. Research ethics is not just about protecting participants and preventing conflicts of interest but among other things maximizing the value of research. It is hard believe a qualified group would have missed the incorrect anticipation of the appropriateness of analyzing nominal response with correlation. That is, if they had they carefully reviewed it. These quality issues complicate any careful interpretation of the survey and definitely suggest selective non-response.

I know at the start of the survey there were numerous emails and twitter comments among signatories suggesting that the survey should not be responded to. Additionally, there are also many wary comments from respondents in the survey responses. Given this, I think more would have been learned from the survey if a group without a previous strong position on signatories had been asked to do the survey. If the authors had asked them to do this, they simply would have needed to indicate this. If another group was not available or willing to do the survey, research ethics approval should have been sought rather than presumed unnecessary.

27 thoughts on “Filling/emptying the half empty/full glass of profitable science: Different views on retiring versus retaining thresholds for statistical significance.

  1. As an editor, I would have held off publishing the exchange back and forth between John Ioannidis & Valentin Amrhein et al at least another a month after the availability of the 43 Tandline Stat Sig articles b/c, in my view, the exchange overshadowed the potential for an even more robust discussion about the Tandline articles, which seemed, at least to a few of us non-experts following statistics controversies, to be left simmering on a back burner.

    Re Survey: I welcome any query [including surveys] that sheds light on the sociology of expertise, for the latter has continued to prevent or hasten the potential for better decisions.

    Lastly, I thought that the survey excluded the email addresses when all was said and then. Sorry just didn’t pay that much attention to the chronology of the surveying.

  2. ““In contrast to Ioannidis, we and others hold that it is using – not retiring – statistical significance as a “filtering process” or “gatekeeper” that “gives bias a free pass”.”

    But it didn’t actually work that way in BASP when it wasn’t allowed. Ricker et al in “Assessing the Statistical Analyses Used in Basic and Applied Social Psychology After Their p-Value Ban” (https://www.tandfonline.com/doi/abs/10.1080/00031305.2018.1537892) write

    “In this article, we assess the 31 articles published in Basic and Applied Social Psychology (BASP) in 2016, which is one full year after the BASP editors banned the use of inferential statistics…. We found multiple instances of authors overstating conclusions beyond what the data would support if statistical significance had been considered. Readers would be largely unable to recognize this because the necessary information to do so was not readily available.”

    Also, please read “So you banned p-values, how’s that working out for you?” (http://daniellakens.blogspot.com/2016/02/so-you-banned-p-values-hows-that.html) by Lakens.

    Using public misunderstanding of ‘chocolate good for you’ vs ‘chocolate now bad for you’ type of argument is silly (although they use results from a frequentist survey to make their point, oddly enough). Frequentist methods of course allow for error, and this ‘flip flopping’ based on stochastic data or different findings/experiments would not be solved by Bayesian methods either, and a meta-analysis would be more ‘conclusive’ than individual studies.

    Moreover, “bias” with alpha levels is being used as a boogeyman. P-values were used in recent Nobel prize winning science as well as analyzing data to determine quantum supremacy. See http://www.statisticool.com/nobelprize.htm and http://www.statisticool.com/quantumcomputing.htm for some examples.

    Justin

    • Yep, if they are still testing a strawman hypothesis it is still NHST, doesn’t matter how you do it.

      And I have no idea why you would have such respect for the Nobel prize, it is a political event.

      And quantum supremacy… No one has used quantum anything for any practical purpose. Just like GR, there are empirical fudges to the classical theory they use supposedly based on the “real” theory, but no one actually uses quantum mechanics for real stuff.

      • Eg, if they were actually using quantum mechanics to make cpus these strange behaviors would be predicted beforehand:

        Quantum effects are becoming more pronounced at the most advanced nodes, causing unusual and sometimes unexpected changes in how electronic devices and signals behave.

        Quantum effects typically occur well behind the curtain for most of the chip industry, baked into a set of design rules developed from foundry data that most companies never see.

        […]

        That little bit further away from the interface is a quantum effect, and at 130/90/65nm it became a measurable delta in the behavior of the inversion capacitance. We went off and studied, learned it, and built it into our predictive device models.
        https://semiengineering.com/quantum-effects-at-7-5nm/#comment-238213

        Instead, it is all empirically determined deviations from the classical models.

        • Check the semi-engineering link for an example of the state of the art. It is all post hoc quantum explanations for empirical fudges. It is just too computationally expensive to a priori calculate the consequences of QM for any practical application… so far.

        • This has a component of truth. Generally my impression is some people use intuition about quantum things to guess that some phenomenon should happen, and then for the most part to tune that thing a bunch of experiments are run. Noone figures out that the optimum doping rate is 2.3% or whatever by first principles QM calculations.

          The same thing is still sort of true for CFM, you use computational fluid mechanics to show that adding dimples lowers drag or whatever, but you still validate in a wind tunnel. the CFM is mainly a way to avoid running so many physical tests.

          Lots of molecular Dynamics are Newtonian calculations with tweaks to the potential energy that account for effective properties of QM… like multi-body potentials or whatever. That’s what my solid state physics friend said anyway.

        • Yes, essentially what they use in practice could be called “inspired by” QM. But due to the heavy empirical component, afaict knowledge of QM is not necessary in any way to make a modern CPU/etc.

    • And in my world the ONLY thing the courts have picked up on in this debate is that statistical significance is out and thus: “Plaintiffs’ expert submitted an expert report indicating the two corrective disclosure dates at issue here returned p-values of 0.234 and 0.233. These p-values suggest there is a 77 percent chance …the corrective disclosures … negatively impacted EZCORP’s stock price … What they do not suggest is that the misrepresentation ‘did not affect the stock price.'” Nowadays if a man’s savings are to be seized all that’s required is p=.49 or less. https://scholar.google.com/scholar_case?case=17598698939675929167

      Perhaps a b-value, for bias, is in order.

      • According to Susan Haack, the George Mason Law & Economics program has been featuring briefings and conferences about Science & the Law in litigation. It’s time for another robust effort given that several science/law educational efforts were defunded.

      • Judging Science by Foster and Huber goes into this in great detail. It is an analysis of the Daubert Supreme Court Decision. In that case, the argument of p=.49 or less was actually presented – and the subject of considerable expert and legal argument. It makes very interesting reading.

    • The elephants in the room are ‘conflicts of interests’. One of the underlying fears, rarely stated, is that if we were to arrive at accurate outcomes, we might see a substantial reduction in products and services. We have continued to focus on quantification to an unprecedented degree toward drawing inferences/outcomes. An entire field may be relegated to the dustbin of history.

      I think John Ioannidis’ How Evidence-Based Medicine has been Hijacked’ prompts a much deeper discussion. In some sense, the ‘stat sig and p-value’ subject has become more of a focused distraction. Please, please feel free to suggest I’m off base.

      Actually, in reviewing Susan Haack’s work, it is uncanny that I have made some of the same arguments even before I came across her recent lectures. Both of us acknowledge substantial skepticism over claims and whistles about the ‘scientific method’.

    • Ah, I see Justin is STILL talking about the Nobel & why we should all be using NHST because some Nobel winners might have. Fun. RE: the BASP example, I agree that some of the stuff that authors were dong was pretty crummy in that issue. All that revealed to me was that folks trained weakly in stats were submitting papers to the ‘journal without p-values’ and were using it as a means to continue to publish in spite of weak data and/or inability to construct arguments without appeals to ‘significance’. Reviewers shouldn’t have accepted that, any yet they did. Is that an example that misguided bans can be problematic? Absolutely. But, to me, what that really demonstrates is that so many folks in disciplines like psychology, poli sci. etc. quite literally cannot think about data without some benchmark of ‘significance’. Their entire studies are designed around obtaining it, rather than answering a substantive question about the world. That is pretty damning of the quality of research and inference. But, certainly not strong enough evidence to claim that moving away from p-values would be detrimental to science.

      • “Ah, I see Justin is STILL talking about the Nobel & why we should all be using NHST because some Nobel winners might have. Fun.”

        Actually, I don’t care what you do DC, :) , but do check out http://www.statisticool.com/nobelprize.htm

        And don’t forget about using p-values and statistical significance to assess evidence for quantum supremacy: http://www.statisticool.com/quantumcomputing.htm

        Many more examples to come, from vaccinations, fluoride, water safety and other sciencey stuff.

        Cheers,
        Justin

        • Many more examples to come, from vaccinations, fluoride, water safety and other sciencey stuff.

          What a strange list. All of those are instances where governments are forcing medical procedures onto their citizens. That would explain the continued funding of NHST by the government…

  3. I don’t think there should be any rules on how people present their data or test their results.

    Researchers should use whatever method best elucidates/analyzes the relationships in their data at the relevant scale of observation and for the purpose of the study.

    Some approaches that have been or still are in wide use clearly have problems and the scope of conclusions that can be drawn from them has been dramatically overstated. The simple p <0.05 method has some utility at a coarse level for certain purposes. But the occurrence of p < 0.05 alone isn't proof of *any* hypothesis.

    But from what I've seen, most research that doesn't replicate "significance" has worse problems than the utilization of p values as the exclusive arbiter of truth. It also has a poor theoretical basis and what theoretical basis does exist relies on unsatisfied and frequently unrecognized assumptions. None of this will change simply by changing the p-value rule.

    Also, regardless of how P-values are used, **all** research needs to be tested and replicated multiple times. Again, changing p-value rules won't change that either.

    What people should be demanding is much quality experiments, higher quality analytical methods, more open data, ***ALOT*** more general skepticism about results and ***ALOT*** more replication.

    • > I don’t think there should be any rules on how people present their data or test their results.

      I suggest a “meta rule”. Everyone should make a logical argument about why their conclusion makes sense. Whether you use stats or some other method, it had better follow some logic that isn’t based on verifiably incorrect assumptions.

      The attempt to short-cut this is at the heart of most of the examples of bad science we see on this blog. For example from today: https://statmodeling.stat.columbia.edu/2019/11/05/the-incentives-are-all-wrong-causal-inference-edition/

      • Daniel said:
        “I suggest a “meta rule”. Everyone should make a logical argument about why their conclusion makes sense. Whether you use stats or some other method, it had better follow some logic that isn’t based on verifiably incorrect assumptions.

        The attempt to short-cut this is at the heart of most of the examples of bad science we see on this blog.”

        +1/2 –I’d modify it to “Everyone should make a logical argument about why their assumptions and reasoning to their conclusion make sense.”

  4. > “I also can’t help but wonder how much of the discussion that ensued from the initial Nature commentary could have been avoided if less strict page limitations had been allowed.”

    Well, a lot of the discussion also continued with these two prints:

    1. https://arxiv.org/abs/1909.08579

    2. https://arxiv.org/abs/1909.08583

    especially part 2, which was also discussed on this blog (https://statmodeling.stat.columbia.edu/2019/09/24/chow-and-greenland-unconditional-interpretations-of-statistics/)

    • Zad:

      I was thinking that if some of the material in those two links had been allowed in the original Nature commentary then there would have been less heated “fire works”.

      On the other hand, maybe that sort of outward clash is needed to move forward – maybe that’s why the “two’ links are now so informative.

      p.s. in the process of making this https://statmodeling.stat.columbia.edu/2019/10/15/the-virtue-of-fake-universes-a-purposeful-and-safe-way-to-explain-empirical-inference/ compatible with those two links.

    • Ah, I see that I missed the better part of the prior discussion on Andrew’s blog. What I was trying to suggest in that last discussion was that the extent of disagreements among some experts circles who engage also in technical explanations makes it necessary for those outside those circles [consumers of expertise & patients’ to evaluate the explanations independently b/c conflicts of interests & turf considerations can skew the explanations. I listened to Sander’s February presentation at McMaster’s University, I am in agreement with much of what he said. I don’t think Sander needed to appeal to any authority though lol. Well, he can review what he said. I took away that statistics was in a crisis state. That can be of little comfort to most.

      • John Ioannidis would have to respond to your question Zad. In viewing nearly all of John’s lectures, I took away that both stat sig and p-values were misapplied and overutlized in biomedical journals. So I was surprised at John’s response to Valentin et al.

Leave a Reply to Daniel Lakeland Cancel reply

Your email address will not be published. Required fields are marked *