Thoughts on “The American Statistical Association President’s Task Force Statement on Statistical Significance and Replicability”

Megan Higgs writes:

The statement . . . describes establishment of the task force to “address concerns that a 2019 editorial in The American Statistician (an ASA journal) might be mistakenly interpreted as official ASA policy. (The 2019 editorial recommended eliminating the use of ‘p<0.05’ and ‘statistically significant’ in statistical analysis.)” The authors go on to more specifically identify the purpose of the statement as “two-fold: to clarify that the use of P-values and significance testing, properly applied and interpreted, are important tools that should not be abandoned, and to briefly set out some principles of sound statistical inference that may be useful to the scientific community.”

The task force includes several prominent academic and government statisticians (including lots of people I know personally), and both of its goals—clarifying the ASA’s official position and giving its own recommendations—seem valuable to me.

Following Megan, I’d say the task force succeeded in goal #1 and failed in goal #2.

Goal #1—clarifying the ASA’s official position—was simple but it still had to be done. A few years ago the ASA had a committee that in 2016 released a statement on statistical significance and p-values. I was on this committee, along with other difficult people such as Sander Greenland—I mean “difficult” in a good way here!—and I agreed with much but not all of the statement. My response, “The problems with p-values are not just with p-values,” was published here. Various leading statisticians disagreed more strongly with that committee’s report. I think it’s fair to say that that earlier report is not official ASA policy, and it’s good for this new report to clarify this point.

I can tell a story along these lines. A few years ago I happened to be speaking at a biostatistics conference that was led off by Nicole Lazar, one of the authors of the 2016 report. Lazar gave a strong talk promoting the idea that statistical methods should more clearly convey uncertainty, and she explained how presenting results as a string of p-values doesn’t do that. (It’s the usual story: p-values are noisy data summaries, they’re defined relative to a null hypothesis that is typically of no interest, the difference between p-values of 0.01 and 0.20 can easily be explained from pure chance variation, etc etc., if you need your memory refreshed you can read the above-linked statement from 2016.) It was good stuff, and the audience was alert and interested. I was happy to see this change in the world. But later in that day someone else gave a talk from a very traditional perspective. It wasn’t a terrible talk, but all the reasoning was based on p-values, and I was concerned that the researchers were to some extent chasing noise without realizing it. It was the usual situation, where a story was pieced together using different comparisons that happened to be statistically significant or not. But what really upset me was not the talk itself but that the audience were completely cool with it. It was as if Lazar’s talk had never happened!

Now, just to be clear, this was just my impression. My point is not that other talk was wrong. It was operating in a paradigm that I don’t trust, but I did not try to track down all the details, and the research might have been just fine. My point only is that (a) it’s far from a consensus that statistics via null hypothesis significance testing is a problem, and (b) Lazar’s talk was well received, but after it was over the statisticians in that room seemed to spring right back to the old mode of thinking. So, yeah, whether or not that 2016 statement can be considered official ASA policy, I don’t think it should be considered as such, given that there is such a wide range of views within the profession.

Goal #2—giving new own recommendations—is another story. For the reasons stated by Megan, I disagree with much of this new statement, and overall I’m unhappy with it. For example, the statement says, “P-values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results.” Here’s Megan:

(1) Stating “P-values are valid statistical measures” says nothing of when they are or are not valid (or any gray area in between) – instead, it implies they are always valid (especially to those who want that to be the case); (2) I completely agree that they “provide convenient conventions,” but that is not a good reason for using them and works against positive change relative to their use; and (3) I don’t think p-values do a good job “communicating uncertainty” and definitely not The uncertainty inherent in quantitative results as the sentence might imply to some readers. To be fair, I can understand how the authors of the statement could come to feeling okay with the sentence through the individual disclaimers they carry in their own minds, but those disclaimers are invisible to readers. In general, I am worried about how the sentence might be used to justify continuing with poor practices. I envision the sentence being quoted again and again by those who do not want to change their use of p-values in practice and need some official, yet vague, statement of the broad validity of p-values and the value of “convenience.” This is not what we need to improve scientific practice.

Also this:

The last general principle provided is: “In summary, P-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data.” Hard to know where to start with this one. It repeats the dangers I have already discussed. It can easily be used as justification for continuing poor practices because the issue is a lack of agreement or understanding of what is “proper” and what counts as “rigor.” As is, I don’t agree with such a general statement as “increase the rigor of the conclusions.” Too broad. Too big. Too little justification for such a statement. Again, I’m not sure what a practicing scientist is to take away from this that will “aid researchers on all areas of science” as Kafadar states in the accompanying editorial. Scientists do not need vague, easily quotable and seemingly ASA-backed statements to defend their use of current practices that might be questionable – or at least science doesn’t need scientists to have them.

I was also unhappy with the the report’s statement, “Thresholds are helpful when actions are required.” It depends on the action! If the action is whether a journal should publish a paper, no, I don’t think a numerical threshold is helpful. If the action is whether to make a business decision or to approve a drug, then a threshold can be helpful, but I think the threshold should depend on costs and benefits, not the data alone, and not on a p-value. McShane et al. discuss that here. I think the whole threshold thing is a pseudo-practicality. I’m as practical as anybody and I don’t see the need for thresholds at all.

This new report avoided including difficult people like Sander and me, so I guess they had no problem forming a consensus. Like Megan, I have my doubts as to whether this sort of consensus is a good thing. I expressed this view last year, and Megan’s post leaves me feeling that I was right to be concerned.

This then raises the question: how is it that a group of experts I like and respect so much could come up with a statement that I find so problematic? I wasn’t privy to the group’s discussions so all I can offer are some guesses:

1. Flexible wording. As Megan puts it, “the authors of the statement could come to feeling okay with the sentence through the individual disclaimers they carry in their own minds, but those disclaimers are invisible to readers.” For example, a statement such as “thresholds are helpful when actions are required,” is vague enough that who could disagree with it—but in practice this sentence can imply endorsement of statistical significance thresholds in scientific work. Similarly, who can disagree with the statement, “Analyzing data and summarizing results are often more complex than is sometimes popularly conveyed”? And a statement such as “P-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data,” has that clause “when properly applied and interpreted,” which covers whatever you want it to mean.

2. The goal of consensus. This may be a matter of taste. I agree with Megan when she says, “We need to see the disagreements and discussion more than we need to see a pretty unified statement that doesn’t acknowledge the inherent messiness and nuances that create the problems to begin with.”

3. Committee dynamics. When I was on that ASA committee a few years ago, I felt a lot of social pressure to come to an agreement and then to endorse the final statement. I can be a stubborn person sometimes, but it was hard to keep saying no, and eventually I gave in. This wasn’t a terrible decision—as noted above, I agreed with most of the 2016 report and still think it was a good thing—but it’s awkward to be part of a group authoring a statement I even partly disagree with.

So, yeah, I understand that committees require compromise; I just think compromise works better in politics than in science. To compromise on a scientific report you need to either include statements that people on the committee disagree with, or else write things so vaguely that they can be interpreted in different ways by different people. I think the new committee report fails because I disagree with a lot of those interpretations. I wish the task force had included some more stubborn people, which could’ve resulted in a document that I think would’ve been more useful to practitioners. Indeed, the presence of stubborn disagreement on the committee could’ve freed all the creative people already in the task force to more clearly express their differing views.

The big picture

I hate to end on a note of disagreement, so let me step back and recognize the good reasons for this sort of consensus statement. We’re living in an age of amazing scientific accomplishments (for example, the covid vaccine) and amazing statistical accomplishments of so many sorts. Statistical tools of all sorts have been valuable in so many ways, and I agree with these statements from the committee report:

Capturing the uncertainty associated with statistical summaries is critical.

Dealing with replicability and uncertainty lies at the heart of statistical science. Study results are replicable if they can be verified in further studies with new data.

The theoretical basis of statistical science offers several general strategies for dealing with uncertainty.

These are not just three sentences pulled out of the report; they’re the first three statements in bold type. They’re 3/5 of what the report wants to emphasize, and I agree with all three. More broadly, I agree with an optimistic take on statistics. I have a lot of problems with null hypothesis significance testing, and I think we can do a lot better, but in the meantime scientists have been able to use these tools to do great work. So I understand and respect the motivation for releasing a document that is positive on statistics and focuses on the consensus within the statistics profession. This gets back to Goal #1 listed above. Yes, there are serious foundational controversies within statistics. But, no, statistics is not in a shambles. Yes, people such as the beauty-and-sex-ratio researcher and the ovulation-and-voting researchers and the critical positivity ratio team and the embodied cognition dudes and the pizzagate guy and various political hacks and all the rest have misunderstood statistical tools and derailed various subfields of science, and yes, we should recognize ways in which our teaching has failed, and, yes, as Megan says we should be open about our disagreements—but I also see value in the American Statistical Association releasing a statement emphasizing the consensus point that our field has strong theoretical foundations and practical utility.

There’s room for difficult people like Sander, Megan, and me, and also for more reasonable people who step back and take the long view. Had I been writing the take-the-long-view report, I would’ve said things slightly differently (in particular removing the fourth and fifth bold statements in their document, for reasons discussed above), but I respect their goal of conveying the larger consensus of what statistics is. Indeed, it is partly because I agree with the consensus view of statistics’s theoretical and practical strengths that I think their statement would be even stronger if it did not tie itself to p-value thresholds. I agree with their conclusion:

Analyzing data and summarizing results are often more complex than is sometimes popularly conveyed. Although all scientific methods have limitations, the proper application of statistical methods is essential for interpreting the results of data analyses and enhancing the replicability of scientific results.

OK, I wouldn’t quite go that far. I wouldn’t say the proper application of statistical methods is essential for interpreting the results of data analyses—after all, lots of people interpret their data just fine while making statistical mistakes or using no statistics at all—but that’s just me being picky and literal again. I agree with their message that statistics is useful, and I can see how they were concerned that that 2016 report might have left people with too pessimistic a view.

78 thoughts on “Thoughts on “The American Statistical Association President’s Task Force Statement on Statistical Significance and Replicability”

  1. I agree with you. (A consensus of 2!)

    Does your feeling about committee-driven compromise to create dubious “scientific consensus” extend to the IPCC and the CDC and the FDA? If so, what is the alternative for the non-subject matter expert? Pick a particular expert you trust? Try to become expert enough to make your own judgment? When should you trust one of these consensus opinions, which are entirely social and not at all science?

  2. Isn’t any committee statement on an official position political by nature? Unless maybe it is some sort of meta science, but still, I am not sure how any statement, regardless of content, wouldn’t be pollical. Which is not to say it can’t be of value.

    • Matt:

      It depends. It would be possible to form a committee and then produce a report expressing disagreement as well as agreement. But I agree that committees are typically political. How about this: committees serve two purposes. One purpose is to get a bunch of people with different perspectives to talk with each other, share their views, and see what happens. Another purpose is to get a bunch of people with different perspectives together to make a difficult decision. A scientific committee report could serve one or both of these purposes. One difficulty is that when the report serves the second purpose, this is not always made clear. An outsider might read the report and take it to actually represent the views of the authors, rather than representing a political statement.

  3. “Thresholds are helpful when actions are required.”
    I don’t see how they can say this. It has no place in the research – decision makers should supply the thresholds, if they are to be used, and preferably after some appropriate cost benefit analysis. Why on earth would the researcher need to supply the threshold? It can be useful for the researcher to provide an analysis of how different thresholds would affect things of interest, but the statement is far too vague and easily suggests that it may be appropriate for the researcher to use thresholds. I think that practice is counterproductive (and contributes to headline-grabbing poor research practice).

    • Andrew and you are both missing an essential piece here. The paragraph that starts with “Thresholds are helpful” contains this: “If thresholds are deemed necessary as a part of decision-making, they should be explicitly de- fined based on study goals, considering the consequences of incorrect decisions.”

      Seems to me that the criticisms of this part are largely misplaced. My criticism would be that the “actions” and “required” are woefully undefined. I would have included a phrase like ‘in the rare occasions where a definitive and final decision has to be based on a single experimental result’.

      • Michael:

        Thanks for pointing this out; I’d missed it. I still disagree with the statement, “Thresholds are helpful when actions are required,” but I agree that this additional sentence helps.

  4. The blog commentary points to how useful it is to understand the sociology of experts in addition to the quantitative/qualitative research. I’m always puzzled when the best thinking is marginalized in the interest of building ‘consensus’. It’s critical as you all emphasize that audiences should be exposed to the back channel discussions. Why blogs like Andrew’s is so necessary.

    Maybe Andrew can include some roundtable fora as a feature of his blog. Deborah Mayo features a forum monthly. I’ve learned a great deal from it.

  5. Thanks for providing the “big picture” ending to your post – I agree with it being a good way to wrap up and it provides a perspective that is more productive than I conveyed. The words contained in the statement are important, but so is the “long view.”

  6. This is just a shameless whitewash job. It’s like saying “the fact that someone was willing to assassinate the president of Haiti is not really evidence that there’s a real serious problem in Haiti, all places have problems, we should keep doing what we’ve been doing, maybe try a little harder here and there”

  7. I’m a little confused about how p-values are supposed to be used to express uncertainty. It’s not that I can’t see how they potentially could be. But as I look around I see almost universal use of them to sweep uncertainty away and claim unwarranted certainty.

  8. Interesting.

    I think what’s most interesting is the focus on analytical methods. From what I can see the big problem in social sciences continues to be *research methods*

    • Accidentally posted!

      No change in statistical methods could have saved most of the poor work that’s been discussed here. The problem is that the experiments are bad to begin with. When the experiment is meaningless, does the analytical technique even matter?

  9. I’m not sure of the background here. From what I’ve gathered, the APA commissioned a committee to draft a report on p-values that was approved by the Executive Committee and published in 2016. In 2019, the Executive Director (but not speaking in that capacity) wrote an editorial that explicitly went beyond the report to call for the end of p-values that appeared in an issue of an APA journal along with other articles making similar class. However, the editorial never claimed it was reporting a change in APA policy. Still, the outgoing President of the APA was worried that this editorial was taken as APA policy, so then commissioned another committee on Statistical Significance that (seemed designed to) counter the 2019 editorial and other articles, and the report (along with the an editorial by the outgoing president) was published in a non-APA journal. As far as I can tell, this statement (unlike the 2016 statement), wasn’t approved by the Executive Committee. Its status is thus a bit unclear to me: is it more like the 2016 statement or more like the 2019 editorial?

    Is this account correct, and does strike anyone else as a bit strange?

  10. This blog post invites confusion about the purpose of the 2021 Task Force statement appearing in the Annals of Applied Statistics, which was concerned only with the misperception that the 2019 editorial by Wasserstein, Schirm & Lazar at
    https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1583913
    was an official ASA policy statement. The 2016 ASA Statement on P-values starting at p. 131 here
    https://amstat.tandfonline.com/doi/pdf/10.1080/00031305.2016.1154108
    was indeed an official ASA statement and was not even cited by the 2021 Task Force statement or its accompanying editorial by ASA president and Annals of Applied Statistics Editor-in-Chief Karen Kafadar. Yet the present blog post makes it sound as if the 2016 statement was the main concern.

    The post also misses the political intrigues and motivations behind the composition of the 2021 statement and its accompanying editorial. As I wrote to Higgs, the Task Force statement is a document designed to defend statistical significance. Having 15 authors no doubt helped keep it bland, but the accompanying editorial seemed exceptional for its carelessness and lawyering unconstrained by any sense of fairness. Example: It cherry picked from the Opinion the U.S. Supreme Court in the famous Matrixx case; see the full Syllabus and Opinion at
    https://www.law.cornell.edu/supct/pdf/09-1156P.ZS
    https://www.law.cornell.edu/supct/pdf/09-1156P.ZO
    These documents have many statements which should have been cited if any were, starting with “It held that the District Court erred in requiring an allegation of statistical significance to establish materiality, concluding instead that the complaint adequately alleged information linking Zicam and anosmia that would have been significant to a reasonable investor.” See especially p. 1-3 of the lead Syllabus and then p. 1 and 8 onward of the Opinion of the Court; this was BTW a rare unanimous 9-0 opinion of the Roberts Court; note that everyone from Antonin Scalia to Sonia Sotomayor signed on to those passages. When I read their Syllabus and Opinion I thought “these justices are smarter than many high-ranking statisticians”, as it left no doubt they weren’t taken in by the very high-powered corporate defense lawyers and prestigious statistical experts arguing for the pivotal importance of statistical significance.

    On top of its selective quotation, the Annals editorial claims “The Task Force was intended to span a wide range of expertise, experience, and philosophy” – No it wasn’t: As is clear from the start as well as from its content and focus on law, it was cherry picked to ensure that statistical significance would be vindicated for the sake of the corporate defense firms that have taken a beating from the Matrixx decision. The result is the most biased presentation imaginable given the need to have enough prestigious names onboard to maintain a facade of a “wide range” while carefully excluding critics of statistical significance (e.g. Wasserstein, Schirm and Lazar). For significance critics, the Task Force Statement and accompanying editorial are nothing more than political tracts, and would be more aptly named “The Empire of Statistical Significance Strikes Back”.

    • Sander, You would have made a great prosecuting attorney. Truly awe-inspiring argumentation you pose sometimes.

      I am aligned with your thinking about the blog post. But as a non-expert, I was reluctant. I am more likely to show boldness in the foreign policy arena. I figured you would state your viewpoint so brilliantly.

    • Jack,

      I couldn’t find a means to respond to your post. As Sander pointed out, experts blame their students and the public at large for not understanding statistical concepts. This blame-game is so disingenuous and downright asinine now, in light of the discussions I’ve observed for the last 6 years in particular. I don’t see how such confusion can go on for another 50 years let alone for another 5.

      • Sameera,

        Fully agree. So much of what is accepted as true is just w-r-o-n-g wrong. When I read about the defense of NHST paradigm or string theorists suggesting falsifiability isn’t necessary or mathematicians discussing the Mochizuki scandal one minute and denigrating automated proof solving the next, it sometimes it feels like I’m back in kindergarten being told to touch the sour granny smith apple to the sides of my tongue and the sweet red delicious apple to the tip because, well, *everyone* knows that’s how tastebuds work—despite even the intense power of suggestion, 5 year-old me never could taste the difference.

        And in some ways, the power dynamic is still the same. To the veterans of the ivory tower, I, merely someone with a BS in Stats, am the naïve 5 year-old boy. I am hopeful because of the progress of the last decade in the wake of the ESP paper and Ioannidis’s (and Andrew’s!) outreach, but am similarly pessimistic in the short-to-medium term because I’ve seen the obstinance up close, even at the math & stats departments at a top institution from professors I knew should know better.

        • Thank you so much for responding. I loved your phrasing of your experiences. You don’t come across as someone with merely a background with a BS in Stats. I too am concerned that another 10 to 20 years will pass in unresolvable debate.

          I wonder whether we have just become excessively argumentative.

    • It is correct to point out that the Task Force report concerns Wasserstein (2019), not the 2016 ASA Report. (It’s hard to understand Gelman thinking it concerned replacing the 2016 statement given his concern in this earlier post:
      https://statmodeling.stat.columbia.edu/2020/02/21/whats-the-american-statistical-association-gonna-say-in-their-task-force-on-statistical-significance-and-replicability/

      The ASA charged the Task force to write a report without “leaving the impression that p-values and hypothesis tests—and, perhaps by extension as many have inferred, statistical methods generally—have no role in ‘good statistical practice’.”
      https://magazine.amstat.org/blog/2019/12/01/kk_dec2019/

      Reference to the Matrixx case, which is not in the Report, but cited in Kafadar’s editorial, has indeed been exploited to claim a Supreme Court ruling against statistical significance tests, simply because Sotomeyor points out, the obvious fact, that evidence of harms worth reporting by a company (that hid the existing lawsuits) can be found in anecdotal data outside of formal statistical tests. However it’s wrong to suppose that this obvious fact, actually irrelevant to the Matrixx ruling, indicts statistical significance tests.
      See “distortions in the court”
      https://errorstatistics.com/2012/02/08/distortions-in-the-court-philstock-feb-8/
      and Appendix B of my “P-values on trial”.
      https://hdsr.mitpress.mit.edu/pub/bd5k4gzf/release/3

      Unfortunately, the Matrixx ruling has been used to defend illicit p-values and data dredging in the face of nonsignificant results as with Harkonen (discussed in P-values on trial).
      (Kafadar, I assume, did not have the full story.)

  11. There are three kinds of statistician. Those for whom statistics is their primary field; those who are experts in statistics as applied in a social science field, often called methodologists or quants; and those who are experts in a social science but consume and use statistics (mostly) heuristically. That last group is by far the largest. So when we talk about a consensus on statistics, we aren’t really talking about a consensus among statisticians, or even among statistical experts. My personal observation has been that the first two groups do not find p-values controversial, only debating the necessity and extent of indulging the expectations of others, such as non-expert stats users, journal editors/reviewers, and funding agencies. That debate is far from trivial: we know whence our scientific credibility, not to mention paychecks, come.

    I can only really speak for the second group, being a quant in psychology and education. I think, as a group, we tend to be self-conscious about our field. We often have to convince non-experts of points we know to be logically and empirically true, but that happen to be inconvenient. Our presence in a project or study sometimes feels like a box being checked in a grant application. We are frequently consulted for advice, but that advice receives about the same weight as “this is how our lab has always done it.”

    I think I can see some of that self-consciousness in Andrew’s post. Is there another empirical field where a statement of this sort, which eschews making factual statements unless they are unfalsifiable, would elicit effusive praise for merely acknowledging that the field is…useful? How likely is a medical doctor to say something like: “I respect [the committee’s] goal of conveying the larger consensus of what medicine is…” or “So I understand and respect the motivation for releasing a document that is positive on medicine and focuses on the consensus within the medical profession.” or “…but I also see value in the American Medical Association releasing a statement emphasizing the consensus point that medicine has strong theoretical foundations and practical utility.” or “Medical tools of all sorts have been valuable in so many ways.” If you prefer, replace “medicine” with “engineering” or any other applied quantitative field.

    Seriously: Statistical tools of all sorts have been valuable in so many ways. Is this really coming from the same guy who wrote the previous post about bad journals? Were I an editor of one of those journals, my response now would be: “You know, Andrew, editorial policies of all sorts have been valuable in so many ways.”

    There are two statistical paradigms in the social sciences right now. One is predominant and the other is correct. At some point in the future, hopefully, a paradigm shift will occur and what is correct will be predominate. This post, even more than the ASA report, indicates that point is not near.

    • Michael Nelson said, “How likely is a medical doctor to say something like: “I respect [the committee’s] goal of conveying the larger consensus of what medicine is…” or “So I understand and respect the motivation for releasing a document that is positive on medicine and focuses on the consensus within the medical profession.” or “…but I also see value in the American Medical Association releasing a statement emphasizing the consensus point that medicine has strong theoretical foundations and practical utility.” or “Medical tools of all sorts have been valuable in so many ways.” ”

      I’ll take this as entree to bring in what I had been doing just before looking at this blog: Namely, A few minutes ago, I saw the following on ASA Connect:

      “Acorn AI by Medidata has an opening for a Statistician, Manager to join our growing Synthetic Controls Team. This is an innovative position focused on the creation of external controls for enhancement of pharmaceutical clinical trials for regulatory and non-regulatory applications. Interested candidates are invited to get more information and submit an application using the following link”. Out of a combination of curiosity and skepticism, I clicked on the link. The job description started,

      “Your Mission:

      Provide statistical expertise for the development and support of innovative approaches to optimize the conduct and science of clinical trials, including use of synthetic control arms.”

      Not knowing what “synthetic control arms” means, I looked it up, and found this link: https://www.statnews.com/2019/02/05/synthetic-control-arms-clinical-trials/

      Well, this looked like it might be an even bigger can of worms than p-values. I’d be interested in what people think of it.

      PS: Questions regarding clinical trials have been on my mind lately because my (new) primary care physician is urging me to take medication claiming to improve vitamin D blood levels — but there seem to be lots of questionable aspects to the research on it.

      • Martha,

        I think this is both an interesting philosophy of statistics question, and a potential minefield in terms of actual implementation.

        On the positive side, the idea is both simple and elegant: why does every study needs its own “control group”? What if instead we had a large of pool of donor “control” observations and researchers could pull from that pool a sample that fits the inclusion criteria of their trial. This seems like an obvious way to save time, money, and experimenting on people (which I generally take to be a negative, even if sometimes a small one – less experimenting on people is (weakly) better for society).

        On the negative side: I have two major concerns. First, unlike recruiting for a double-blind study, the selection-into-Control mechanisms are likely to be much different than selection-into-Treatment. This makes comparing “apples to apples” more difficult, regardless of which selection mechanism leads to one group looking more like the overall population than the other. Second, a whole lot of clinical trials have really narrow enrollment criteria, and unless this synthetic control pool is absolutely huge, you might end up with very few people serving as the comparison for a whole bunch of similar trials. It takes out half of the randomness in sampling, essentially fixing a control group (that is really a sample) and so if that control group is weird in some way (through luck or selection) then you can contaminate a really large bit of literature.

        I do think there are places where having some super-sample that you use repeatedly as a control group might make some sense, but it isn’t clear to me that clinical drug trials are the place that makes the most sense. In general, a whole lot of labor Economics uses the same observations over and over (e.g. the CPS), so it isn’t like we don’t already do something like that, it’s just not formalized in an experimental context. But it is routinely done (but not thought about so much in this way) in observational/quasi-experimental research using large labor market or demographic surveys.

      • Synthetic controls obviously apportion the entire placebo effect to the drug being tested. Depending on your feelings about the placebo effect, that’s a good idea or a terrible one.
        In general, though, the clearer the markers of effect, the more plausible synthetic controls become. If you have a tumor that has never spontaneously gone into remission, or does so at a a known low level, a drug which causes a higher level of remission probably doesn’t need a real control. But for things that relieve symptoms or lower pain, I don’t see how you get around real controls.
        To your PS, my primary care doctor did the same thing — I think it’s a common thing now. And I did some research and, yes, the evidence is equivocal. But the cost seems really, really low relative to the potential benefits, so the p-values don’t matter much. (I would note, the Dr. eventually lowered my Vitamin D supplementation as the levels rose.)

        • Thanks for your reply, Jonathan. In my case, my vitamin D blood levels are in the range (above 20 ng/ml) that a 2011 Institute of Medicine report (see https://www.health.harvard.edu/staying-healthy/how-much-vitamin-d-do-you-need) considers adequate for maintaining healthy bones. Also, the report says, “People who should consider vitamin D testing are those with medical conditions that affect fat absorption (including weight-loss surgery) or people who routinely take anticonvulsant medications, glucocorticoids, or other drugs that interfere with vitamin D activity,” and I don’t fit into any of those categories — I’m just short with BMI at the low end of normal weight.

        • Martha: If you are considering vitamin D supplementation, critically research the literature on recommendations including
          1) Use only daily oral supplements – bolus injections or infrequent massive oral doses have the worst track record. But they have to be taken with fats.
          2) Some recommend D needs balance with K2 supplementation, preferably the MK7 form in the range of 50 mcg K2 per day per 1000iu (25mcg) per day of D3.
          3) Some advise no more than 2000iu/day of D long term because benefits seem to level off after 1000iu and toxicity concerns have arisen with as little as 5000iu/day; yet some sources are pushing 5000 iu/day and claim it’s safe “long term” despite lack of truly long-term (decade+) trials. And if you get a lot of sun without sunscreen or drink a lot of milk you might factor that in (in the U.S. milk is usually vitamin-D fortified).
          4) Use D3 (not D2) – that’s old advice which I throw in for completeness.

        • Martha, Sander

          The D-Minder app is very helpful for sun exposure for Vitamin D. I had my D levels checked. I have been in the mid-50s range by sunbathing for about an hour daily since March. Not sure whether Vitamin D fortified foods are as beneficial as sun acquired Vitamin D. Actually it is a hormone.

        • Yes, I have been looking a lot at the literature, ever since my physician prescribed a vitamin D regimen, and there is a lot that seems iffy. I am in the process of writing something to give to her that says why I am not taking the prescription she prescribed (ergocalciferol capsules — D2). There are lots of reasons why I think it would be irrational to take it. The large dosage is one thing that struck me when I picked up the prescription — this seemed iffy to me to begin with, especially since I am on the small side. When I found out that clinical trials for the drug had not had enough over-sixty-five patients to make inferences for them, I shook my head even more. And, to top off the inanity, the capsules have a dye in them (tartrazine)that is a common allergen. The package insert does say do not take if you are allergic to the dye, but that presupposes that you know that you are allergic to it.

        • I would not take those capsules. Even I am cautious as to the claims made about Vitamin D from sun exposure. I find that I get very good sleep when I am outside for several hours each day. It is claimed that sun exposure increases serotonin levels and is implicated in regulating melanin production which aids in sleep.

          It is suggested that individuals with fair skin need only 5 to 15 minutes several times a week. So perhaps you can experiment and then have your D levels checked.

          I personally would not take more than 2000 units of a D supplement.

        • In response to Sameera: I have, for the past couple of months, been spending an hour or more about three times a week outdoors in partial sun/partial shade cutting and bagging dead wood from shrubs that died back during a big freeze we had this past winter. I have been wondering if this would be enough to raise my low vitamin D blood levels. However, I have heard that one’s ability to make vitamin D in response to sunshine declines with age. I am scheduled for another blood test in a couple of months, so hope that it will show vitamin D blood levels higher than the last test.

        • Hi Martha,

          RE: In response to Sameera: I have, for the past couple of months, been spending an hour or more about three times a week outdoors in partial sun/partial shade cutting and bagging dead wood from shrubs that died back during a big freeze we had this past winter.
          —–
          I have gleaned from many articles that one should expose at least 50-60 % of one’s skin to increase levels. I usually expose about 70-80 %.

          Moreover, I have read that one needs longer exposure as one ages. Surprisingly, for someone above 60-65 years, it is suggested at least an hour or more. The more skin you expose the less exposure time is needed.

          I would get a baseline Vitamin D test now. Then re-test at the end of August. Then you also have to consider what your strategy will be in late fall and winter. At least for a year or so, I would get tested every 3 or 4 months. In the winter, a safe D dose is about 2,000 units. Again, older adults may need more for optimal levels.

          Some suggest that you consult with your doctor. But that is helpful when the doctor knows a lot about Vitamin D. I follow the advise Dr. Neal Barnard who is the founder of Physicians Committee for Responsible Medicine and Dr. Dean Ornish who is a founder of the Lifestyle Medicine Movement.

      • It is at least as big a can of worm as p-values. In addition to almost certainly assuring that comparisons will be made between groups that start out different in material ways, it is obviously easy to do this in a way that specifically manipulates things to produce desired results. This whole thing is being pushed by anti-regulatory ideologues and by the pharma industry. The industry has migrated form its century-ago business model of putting snake-oil in bottles and selling it to the public, to a new business model of using snake-oil analytic methdologies and selling the results to the FDA.

        Yes, standard RCTs are expensive and do delay the release of new drugs. And dislike of being assigned to placebo is, indeed, a barrier to subject recruitment. If most, or even many new drugs were truly of breakthrough quality, that might really matter. But most new drugs, let’s remember, are minor molecular variants of existing drugs, and their major “benefit” is that they can win a patent. Moreover, there are statistically sound alternatives to deal with these problems. Bayesian sequential designs that update the allocation ratio as the trial progresses both reduce total expected sample size and exposure of participants to the inferior treatment.

        If the FDA is sincerely interested in speeding up the approval of major breakthrough drugs, rather than lowering the bar to accept potentially fraudulent study designs, they might consider things like not taking the weekend off when reviewing emergency use authorization for a vaccine whose clinical trials were complete and reported while thousands of Americans were dying of Covid-19 every day. Or they might try to learn how to figure out whether more lives can be saved with a somewhat less accurate test that is widely available and accessible than a “gold standard” test that cannot be sufficiently scaled up. Just sayin’.

        • I just Zoom-attended a talk on synthetic data sponsored by my local ASA chapter. The talk outlined a method for selecting synthetic “control groups” for clinical trials. Not surprisingly (to me at least), the emphasis was on the mathematical models, but with little discussion of what values of parameters, etc., to choose for the models that would reflect what factors are important for the question being studied. I think the slides are going to be posted; I’ll try to give a link to them if I can.

      • The high road would be the use of multiple high quality (occasionally audited) data sources (such as https://www.ohdsi.org/) to inform the design of RCTs, supplement them carefully when appropriate and even possible replacements of concurrent controls where and when very problematic (few patients, time delays or high costs).

        The low road would be an investigator who gets likely inaccurate and incomplete data from a single hospital/institution and does a quick and dirty stepwise regression analysis.

        Believe the high road should not be overlooked.

      • As I well know — I’ve served on a a few Ph.D. committees of students in biology and engineering. One biology committee in particular I well remember: The student’s advisor suggested that the student should use a different statistical methodology in one statistical situation. The student, I, and another committee member all agreed that the method the advisor suggested was inappropriate. Wow was her face red!

  12. An experiment is performed. What were we looking for? Did we find it or not?

    [A] If we did not find what we expected to find we’d scarcely be tempted to throw in the towel just then and there; but to do the experiment again and again under tighter and tighter control until either we find what we were looking for or we change our minds about what we thought ought to be the case. That is what happened with Michaelson-Morley and its lineage. This is the case of surprise.

    [B] What if, on the other hand, we do find what we were looking for? Well, then, we do the experiment again under tighter control, in order to refine our estimate of that very thing we expected to find (and did find). This is the case of confirmation; say we measure the decay time of muons to shore up support for the time-dilation model in relativity.

    What if we don’t actually know what we are looking for? [C] Keep good books on what is seen and what is not seen until it seems worthwhile to set up an experiment, to go looking specifically for that thing (or for its absence).

  13. Statisticians routinely use capital letters when discussing random variable, and lower case letters to indicate particular realizations of those random variables (X versus x), but they inadvertently and routinely make one important exception to this practice: the P-value (or p-value). This article used both conventions. I wonder why. If statisticians would consistently use “P-value” when discussing them in general, it would help drive home the fact that P-values are random variables with their own sampling distributions. Widespread awareness of this fact would be a good thing. Scientists routinely take a p-value (lowercase, the realization of a random variable) and then behave as if it had large sample generality, when in fact it does not. It is a realization of a random variable, and it’s not very robust at that. (Bootstrap the data and see the distribution of p-values if you’re not convinced.) Following the convention of X versus x should not have P-values as an exception. The ASA should note this.

    • James:

      I prefer lower case for everything. But that’s because, as a Bayesian, I don’t distinguish random variables from variables more generally. In the Bayesian world, if I call something y, then if I know it, it’s a number, and if I don’t know it, it has a probability distribution. I agree with you about p-values being random variables, and I think that comes naturally from the Bayesian perspective.

      • I find the tradition of using uppercase/lowercase useful for teaching because we often need to distinguish precisely between the name or concept of the variable X (e.g., age, height, weight) and a particular yet unspecified value x it can take (e.g., in years, in cm, in kg), e.g., to make sense of distributions and conditioning. Much confusion I saw among students reading stat textbooks and methods papers came from notation and language unclear about which category an expression was referring to (some claimed Fisher’s “fiducial inference” was an example!). For example, the mean function f(x) = E(Y|X=x) is the “regression of height Y on age X” not the “the regression of height Y on x years” as the shorthand E(Y|x) makes it look.

        • E(y|x) as instantiated with E(50 kg|60 years) – a notation guaranteed to confuse those without a degree in a math science or stats. But then, obscurantist notation does help make those with such degrees look like they have powerful insights, much as did the symbolic incantations of ancient priests.

          Regardless, E(y|x, beta) is less general than E(y|x) since the regression concept is nonparametric [yes I know some Bayesians have a very hard time with nonparametric concepts, especially since they show how fully Bayesian parametric approaches can break down spectacularly, e.g. see Ritov, Bickel, Gamst, Kleijn. The Bayesian Analysis of Complex, High-Dimensional Models: Can It Be CODA? Stat Sci 29, 619-639, 2014].

        • Sander:

          1. I’m sorry if my notation is guaranteed to confuse people. My first goal in writing notation, even for textbooks, is to help me work things out. I’m my own primary audience. When it comes to doing Bayesian algebra, I think our notation works much better than notation I’d seen before. To me, the notation p(y|theta) = normal(y | mu, sigma) works very well, and we were able to put it directly into Stan, which is used by lots of people without a degree in a math science or stats. But no notation is perfect; all have their strengths and weaknesses.

          2. Indeed, in my comment above, I was originally going to write E(y|x,theta) and then say that theta represents the vector of all parameters in the model. I wrote E(y|x,beta) because I figured it would be clearer to refer to the simple linear model. But it’s fine with me to just use the more general E(y|x,theta), and theta could even be infinite-dimensional.

        • Fair enough, apologies for being so acerbic…
          I have no doubt you know exactly what you are doing for your own work and the stat students you target. But my concern is with the very nonquant students like those I had to help understand stat books and article sections. For them I see no analogy between E(y) and p(y) and there is a crucial distinction your notation blurs:
          It makes sense to denote the probability density for Y at 50kg by p(50kg), [although advanced probability and stat also considers derived variables like p(Y) where Y is now “weight”, not a particular weight]. In contrast, E(50 kg) is 50 kg regardless of conditioning and is what is intended by use of E(): math expectation by definition refers to averaging over a distribution of weight, which makes E(Y) = 50kg a nontrivial statement about a property of that distribution of weight, not a property of a particular weight (that’s true even if that distribution may be degenerate, e.g., with all mass at 50kg).

          I know you aren’t one of them, but I’ve seen how much some statisticians love to dismiss these fine points, and for that I hold them accountable for much of the confusion I see among stat users (especially when those users try to teach or explain their methodology). Worse, some statisticians “blame the victims” for the confusions, faulting them for not adapting to frankly incoherent notation coupled with misleading terminology. If a company selling a product line was as insensitive to customer limitations and needs they’d go under fast.

          So, how do you notationally express the difference between a variable and a particular but unspecified value?

        • I think you’re conflating two ideas. One is that Y is like a bucket of values out of which you can pick a quantity repeatedly, and then y is the particular value you just picked.

          But the other view is that y is always just one value, but we don’t know what the value is and hence we need a name rather than being able to simply directly replace it with the number itself. And when we have repeated measures, we need **different** variables to name them.

          If I measure your weight 3 days in a row then I have y1,y2,y3 not Y which takes on the values y1,y2,y3

          Then the probability that the first measurement will take on a given value named y given that we know what some other variable x was is

          p1(y|x)

          of course if we include the fact that we actually have already measured y1, we’d get

          p1(y|x,y1)

          which would be a delta function around the value of y1

          what about the second measurement y2?

          p2(y|x) which will sometimes but not always be the same function as p1 (ie. that’d be the “IID sampling” case)

          When we do have iid sampling, then the p1, p2, p3 etc are all the same function and it’s simplest to simply call it p, leading to

          p(y|x)

          What I see is that the notion of a “random variable” and “repeated sampling” as primary, and hence a distinction between a capitalized variable Y which represents a “random variable” and a realization “y” which represents a “number” is the more damaging idea, as it excludes Bayesian interpretation entirely.

          The interpretation from computer science that p(y|x) represents a function p(.|x) where the value we put in the . location is given a name y for purposes of reference in the definition of the function… This makes more sense to me.

        • Sander:

          In answer to your final question: you can see how we do it in our books. The short answer is that from a Bayesian perspective there is no difference between a variable and a particular but unspecified value.

          Here’s what we wrote in BDA chapter 1; these are the words of John Carlin:

          Different distributions in the same equation (or expression) will each be denoted by p(.) . . . Although an abuse of standard mathematical notation, this method is compact and similar to the standard practice of using p(.) for the probability of any discrete event, where the sample space is also suppressed in the notation. . . .

        • In fact, the capital Y notation is really confusing in my opinion because it pretends that a thing is a “variable” when in fact it’s really a function

          y[i] = Y(i) where i is a positive integer. This is the proper way to think about “sampling”, as when we give different values of i we find different values of y. Mathematicians like to obscure this with their whole “omega” notation, and abstract “space of randomness” but I think that’s just a mistake. In any case, the distinction between y and Y is very clear when we see that y[i] is just the value we get back from taking the i’th sample by calculating Y(i)

        • Sander said:

          > “blame the victims” for the confusions, faulting them for not adapting to frankly incoherent notation coupled with misleading terminology.

          I would argue this is *the* concise explanation for the failure of the last century of mathematics pedagogy. The pedagogical equivalent of “why don’t the starving peasants just eat cake?” (my hero as a math pedagogy curmudgeon, VI Arnold, lays a lot of the blame for this on Bourbaki at https://www.uni-muenster.de/Physik.TP/~munsteg/arnold.html).

          That mathematics and statistics educators will fiercely argue that the emperor’s nonexistent clothes are beautiful and indeed the cornerstone of the whole empire is bewildering.

        • So, how do you notationally express the difference between a variable and a particular but unspecified value?

          This is getting into some radical territory, but personally I wouldn’t? I think of probabilities as being assigned to sets. A variable x can represent a point in that set. In this view, all variables are just points in a real space or vectors or whatever and no distinction needs to be made. What we’re really interested in is the probability of some transformed subset of the original space, I can think of the probability of the preimage of some transformation parameterized by points in another set.

          For example, x can represent some normally distributed real number. X_0 is an arbitrary subset of X, the space x lives in.

          Pr(X_0) = \int_{X_0} std_normal(x’) dx’

          y = f(x; b, c) := b x + c

          Y_0 is an arbitrary subset of Y, the space y lives in

          Pr(Y_0) = \int_{{(y’ – c) / b for y’ in Y_0}} std_normal(x’) dx’

          and for any g: X -> whatever

          E[g] = \int_{X} std_normal(x’) g(x’) dx’

          E[x] can be seen as just shorthand for E[I] where I is the identity function. Since y = f(x; b, c) = b x + c, it’s easy to see that

          E[f] = \int_{X} std_normal(x’) (b x’ + c) dx’ = b E[I] + c since they factor out of an integral that doesn’t depend on them.

          Here, x, y, b and c all always represent particular points–x and y just have an attached probabilistic structure to their ambient spaces. That is to say, I don’t think the key distinction should be between unknown variables with probabilistic structure and known variables. Matrices and vectors and numbers add and subtract and inner product and matmul together exactly the same way whether we know their values or not, after all. The important distinctions are between points, the spaces they live in, subsets of those spaces, and mappings between them. Doing probability this way, I’d rather reserve capitalization to denote sets than “random variables” or anything else.

        • I like starting with this view because it removes the convenient notational magic that comes with capitalized random variables and their peculiar symbolic operations. It might take a little bit more thinking to start off with, but it forces you to think principally about the fundamentals of what probability is. And naive sets are a very intuitive concept that you can introduce to undergrads and get very far with without ever talking about zfc, sigma algebras, or countable additivity. You can introduce integrals over pdfs as how one computes probabilities of subsets and expectations as an operation on those pdfs that summarize the space, without ever mentioning a radon-nikodym derivative. There’s just no reason for probability to have all this notational magic in it.

        • hear hear!

          Furthermore, if you just accept what was obvious to Newton and such, which is that infinitesimals exist. you can do all of it with algebra. IST is a particularly easy to us form of Nonstandard Analysis. Vast tracts of impenetrable formalism of measure theory becomes just addition and such.

        • Daniel Lakeland,

          hear hear back at ya! I’m a strong proponent of infinitesimal thinking & NSA. Standard analysis has its rightful place in history and graduate-level analysis, but I fail to see how it’s beneficial over the infinitesimal approach, especially in European universities where analysis-laden calculus is taught to fresh-eyed first-year students—at least we in the US take analysis after three semesters of calculus. There’s a reason Archimedes nearly invented calculus two millennia ago using pseudo-infinitesimal logic and not Cauchy sequences!

        • I don’t know why I cannot comment beneath Lakeland’s comment in-situ, but it is so striking that I will do it where I can anyway; he wrote:

          “Furthermore, if you just accept what was obvious to Newton and such, which is that infinitesimals exist. you can do all of it with algebra. IST is a particularly easy to us form of Nonstandard Analysis. Vast tracts of impenetrable formalism of measure theory becomes just addition and such.”

          “particularly easy” ?! The tracts of impenetrable formalism to be covered before nonstandard models of the reals can be introduced have to be seen to be believed. Measure theory (the bugbear in this thread today) can at least be half-intuitively followed along. But Bishop’s book? Well, Even once upon a time when I thought I cared, I couldn’t even pretend to follow.

          Now there *is* a worthwhile algebra with a nilpotent element (call it epsilon) with which symbolic libraries can deal out differential identities in any number of variables. And *that* is “just” algebra.

        • a PDF of the IST paper is here:

          http://people.math.harvard.edu/~knill/various/nonstandard/nelson.pdf

          There are two great books on nonstandard analysis that everyone should check out before dismissing NSA as well:

          Alain Robert’s book Nonstandard Analysis (which is published by Dover available as paperback) and

          Henle and Kleinberg’s book “Infinitesimal Calculus” (also Dover, available on Kindle)

          Robert uses Nelson’s IST system. Henle and Kleinberg create their own axiomatic system in which they describe “quasi-big” sets.

          The point is, it’s much harder to construct the hyperreals from ZFC as Robinson did, than it is to simply **use** hyperreals productively.

        • “The point is, it’s much harder to construct the hyperreals from ZFC as Robinson did, than it is to simply **use** hyperreals productively.”

          I think it’s easier to skip all the spooky axiomatics and use the “dual number” extension to the reals A + B*e where and e is a nilpotent e*e = 0. That’s no more of a stretch than introducing the complex numbers. All identities that are true “up to first order” can be captured algebraically. The deep thinking can be left to the logicians.

        • The problem is that nilpotent infinitesimals don’t give you the same structure and it can be super useful to have “different orders” of infinity or infinitesimal. It makes “asymptotic analysis” extremely easy.

          For example what is a + e(x)*b(x) in the vicinity of a region where b(x) goes to infinity but e is infinitesimal?

        • “or example what is a + e(x)*b(x) in the vicinity of a region where b(x) goes to infinity but e is infinitesimal?”

          I don’t think that question is well-posed. You have to posit the *rate* of growth of all the terms. In other words, you cannot dispense with power-series to discover what dominates what and whether or not therefore the algebraic combination looks like O(F(x)) where F(x) is some stipulated comparator.

        • “or example what is a + e(x)*b(x) in the vicinity of a region where b(x) goes to infinity but e is infinitesimal?”

          I don’t think that question is well-posed. You have to posit the *rate* of growth of all the terms. In other words, you cannot dispense with power-series to discover what dominates what and whether or not therefore the algebraic combination looks like O(F(x)) where F(x) is some stipulated comparator.

        • Exactly, but with NSA you can talk about for example e(x) is like say (x-x0) and b(x) is like 1/(x-x0)^2

          But if x-x0 is infinitesimal then b(x) is 1/0 which is meaningless. Yet it’s 1/e^2 in IST and the whole product is 1/e which is relevant later when we want to pass it into a function which has some other growth properties etc

        • “But if x-x0 is infinitesimal then b(x) is 1/0 which is meaningless. Yet it’s 1/e^2 in IST and the whole product is 1/e which is relevant later when we want to pass it into a function which has some other growth properties”

          Yes, that means keeping track of power-series expansions in positive and negative powers of the dummy variable. I.e. the way they taught me once upon a time to do it.

    • Thanks for bringing up the topic of mathematical notation! Quite a lot of interesting comments here.

      Coming from software development, the most interesting thing for me in the Bayesian notation is that p(x) and p(y) are different functions, not a same function with different arguments.

      • Coming from software development, the most interesting thing for me in the Bayesian notation is that p(x) and p(y) are different functions, not a same function with different arguments.

        In terms of sets, (X, Y) is a product space, p: (X, Y) -> [0, 1] and p(x) is shorthand for p(f^{-1}(x)) where f: X, Y -> X is the simple projection. So you can always recover a single canonical function on the original product space

  14. It is always interesting to go up the river and try to find the source of something, including the 2016 p-value statement. Seems to me that a precursor to it was the ASA Statement on Using Value-Added Models for Educational Assessment published on April 8th, 2014. In my book with Galit Shmueli on information quality we assessed the information in the VAM statement and found it very poor. In other words, ASA did not produce an informative statement. The p-value statement is similar and the AMSTAT special issue added confusion.

    Talking some steps back and looking at this from an organizational perspective:
    1. The sign was on the wall. Presumably ASA wanted to be more active and noticeable (my assumption)
    2. As far as I know, ASA did not run a retrospective evaluation of its 2014 VAM statement. For a Stat association this seems appropriate.
    3. Statisticians should be the first to encourage data driven lessons learned (i.e. running a survey on the impact of the VAM ASA statement).
    4. The impact of the ASA p-statement, and the AMSTAT issue, was I believe mostly educated discussions. However ASA (and any Stat association) needs to find an opportunity to play a constructive role in the current digital transformation.

    As past president of two Stat associations) we discussed on several occasions motions to support statements regarding particular aspects in the practice of statistics. I guess associations like accounting and law make such statements. In all cases, the society avoided doing that implying that such activism would be detrimental. ASA proved us right.

    Understanding the experience of Andrew with Nicole’s talk. takes us back to the two systems characterized by Kaheneman. The theory of applied statistics should also address issues of fast and slow thinking. Developing frameworks that cover things like data flows (or generation of information quality) is much needed, beyond ASA statements and presidential task force reports…

    • Ron,

      Very insightful commentary. I am very interested in the role of cognitive biases in reasoning. But I have not come across any work that is actually useful in circumventing biases that are deemed to lead to flawed reasoning. Seems another bias is on the horizon to lead us astray on some path.

      • Sameera,

        Addressing bias is at the core of statistics, way beyond what is done in other disciplines. Part of the protection against bias is the ability to point out mistakes, admit mistakes and accept to inform people about mistakes you made. Andrew has done this several times and provided examples.

        A technical example you might find of interest is the application on cross validation to assess a neural network model when the data is structured, for example coming from a designed experiment. I met people doing that that stuck to their worthless results and people who understood the point made and informed people that the model they uploaded is flawed. The integrity requires in such a retraction should be part of the statistics arsenal. For more on this see https://www.youtube.com/watch?v=Yi-e4sMK5tA

        What is interesting in the discussions in this blog, including the inputs of Sanders, is that Yoav is the champion of selection bias countermeasures. He showed us how the pick and choose of tables or images in publications affects the meaning of the statistical claims. Sanders is right that the pick and choose of the ASA presidential task force is a nice example of selective inference.

        I assume Karen realised that when she formed the task force. Given that, one might look into the cognitive bias that this task force suffered from, in a sense repeating the apparent bias in the 2019 AMSTAT special issue, but in the other direction…

        • Thank you Ron, I’ll be sure to access the video. In terms of addressing biases, statisticians, as experts, identify biases in research quite well. In part, the evidence based medicine movement has had influence on psychology and political science in particular.

          Andrew’s blog is unique in that it allows for discussions of mistakes. But that is not my experience of offline discussions b/c many of them are sponsored by special interests. One has to get the endorsement of a private donor in order to be able to provide alternative perspectives.

          Again, my interest is how to circumvent unhelpful cognitive biases. I have followed the conversations [offline] of many different types of thinkers. Underlying nearly every statement and argument is an identifiable cognitive bias. And it constitutes what I would characterize as ‘incomplete theorization’.

          Sander gets at this problem quite well. But it needs to be made more understandable to experts, in particular. Otherwise simply addressing non-experts just doesn’t cut it. The problem is with the expertise fundamentally.

  15. “In summary, P-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data.”

    It seems that this statement would support the idea that p-values and NHST should *always* be used, in every study, as they will increase scientific “rigor”. The statement supports the idea that papers without p-values may now be rejected or be required to include p-values as such inclusions should be seen as necessary improvements to the science.

    I understand that concept that p-values aren’t all bad, and *can* be used in scientific contexts; I just can’t support the idea that p-values should virtually always *should* be used, regardless of the study and the statistical paradigm employed.

  16. Andrew:
    You wrote:
    “it’s fair to say that that earlier report is not official ASA policy, and it’s good for this new report to clarify this point”
    The 2021 Report of the Task Force on Statistical significance and Replicability wasn’t to clarify that the 2016 report wasn’t ASA policy. It is. It was to deny the Wasserstein et al (2019) editorial is ASA policy. But you go on to make a very wise point:

    “whether or not that 2016 statement can be considered official ASA policy, I don’t think it should be considered as such, given that there is such a wide range of views within the profession”.

    I agree. It would be valuable to regard both the 2016 and the 2019 articles to be simply the views of those who wrote them, acknowledging non-trivial disagreement within the profession and throughout science.

Leave a Reply to ie Rabinovitz Cancel reply

Your email address will not be published. Required fields are marked *