Attempts at providing helpful explanations of statistics must avoid instilling misleading or harmful notions: ‘Statistical significance just tells us whether or not something definitely does or definitely doesn’t cause cancer’

This post is by Keith O’Rourke and as with all posts and comments on this blog, is just a deliberation on dealing with uncertainties in scientific inquiry and should not to be attributed to any entity other than the author. As with any critically-thinking inquirer, the views behind these deliberations are always subject to rethinking and revision at any time.

Getting across (scientifically) profitable notions of statistics to non-statisticians (as well as fellow statisticians) ain’t easy.

Statistics is what it is, but explaining it as what it ain’t just so it is easy to understand (and thereby likely to make you more popular) should no longer be tolerated. What folks take away from easy to understand incorrect explanations can be dangerous to them and others. Worse they can become more gruesome than even vampirical ideas – false notions that can’t be killed by reason.

I recently came across the quoted explanation in the title of this post in a youtube a colleague tweeted How Not to Fall for Bad Statistics – with Jennifer Rogers.

The offending explanation of statistics as the alchemy of converting uncertainty into certainty occurs at around 7 minutes. Again, “Statistical significance just tells us whether or not something definitely does or definitely doesn’t cause cancer.” So if you were uncertain if something caused cancer, just use  statistical significance to determine if it definitely does or definitely doesn’t. Easy peasy. If p > .05 nothing to worry about. On the other hand, if p < .05 do whatever you can to avoid it. Nooooooo!

Now, if only a statistician was doing such a talk or maybe a highly credentialed statistician – but at the time Jennifer Rogers was the Director of Statistical Consultancy Services at the University of Oxford, an associate professor at Oxford and still is vice president for external affairs of the Royal Statistical Society. And has a TEDx talk list on her personal page. How could they have gotten statistical significance so wrong?

OK, at another point in the talk she did give a correct definition of p_values and at another point she explained a confidence interval as an interval of plausible values [Note from Keith – a comment from Anonymous lead  me to realise that I was mistaken here regarding an interval of plausible values being an acceptable explanation. I now see it as a totally wrong and likely to lead others to believe the confidence interval is a probability interval. More explanation here.]  But then she claimed for a particular confidence interval  at around 37 minutes “I would expect 95% of them between 38 and 66” where she seems to be referring to future estimates or maybe even the “truth”. Again getting across (scientifically) profitable notions of statistics to non-statisticians (as well as fellow statisticians) ain’t easy. We all are at risk of accidentally giving incorrect definitions and explanations. Unfortunately those are the ones folks are most likely to take away as they are much easier to make sense of and seemingly more profitable for what they want to do.

So we all need to speak up about them and retract ones we make. This video has had almost 50,000 views!!!

Unfortunately, there is more to complain about in the talk. Most of the discussion about confidence intervals seemed to be just a demonstration of how to determine statistical significance with them. The example made this especially perplexing to me being that it addressed a survey to determine how many agreed with an advertisement claim – of 52 surveyed 52% agreed. Now when I first went to university, I wanted to go into advertising (there was even a club for that at the University of Toronto). Things may have changed since then, but then getting even 10% of people to accept an adverting claim would have to considered a success.

But here the uncertainty in the survey results is assessed primarily using a null hypothesis of 50% agreement. What? As if we are really worried that 52 people flipped a random coin to answer the survey. Really? However, with that convenient assumption it is all about whether the confidence interval includes 50% or not. At around 36 minutes if the confidence interval does not cross 50% “I say it’s a statistically significant result” QED.

Perhaps the bottom line here is that as with journalists who would benefit from statisticians giving advice as to how to avoid being mislead by statistics, all statisticians need other statisticians to help them avoiding explanations of statistics that may instil misleading notions of what statistics are, can do and especially what one should make of them. So we all need to speak up about them and retract ones we make.

P.S. from Andrew based on discussion comments: Let me just emphasize a couple of things that Keith wrote above:

Getting across (scientifically) profitable notions of statistics to non-statisticians (as well as fellow statisticians) ain’t easy.

We all are at risk [emphasis added] of accidentally giving incorrect definitions and explanations.

All statisticians need other statisticians to help them avoiding explanations of statistics that may instill misleading notions of what statistics are, can do and especially what one should make of them. So we all need to speak up about them and retract ones we make.

As Keith notes, we all are at risk. That includes you and me. The point of the above post is not that the particular speaker made uniquely bad errors. The point is that we all make these sorts of errors—even when we are being careful, even when we are trying to explain to others how to avoid errors. Even statistics experts make errors. I make errors all the time. It’s important for us to recognize our errors and correct them when we see them.

P.S. from Keith about the intended tone of the post.

As I wrote privately to a colleague involved with RSS “Tried not too be too negative while being firm on concerns.”

Also, my comment “thereby likely to make you more popular” was meant to be descriptive of the effect not the motivation. Though I can see it being interpreted otherwise.

P.S2. from Keith: A way forward?

From Andrew’s comment below “Given Rogers’s expertise in statistics, I’m sure that she doesn’t really think that statistical significance can tell us whether or not something definitely does or definitely doesn’t cause cancer. But that’s Keith’s point: even experts can make mistakes when writing or speaking, and this mistakes can mislead non-experts, hence the value of corrections.” I should have written something like that what he put in the first sentence, argued these mishaps can cause damage in even the best of presentations and regardless they need to be pointed out to the author, who then hopefully will try to correct possible misinterpretations in most of their same audience. Something like “some of the wording was unfortunate and was not meant to give the impression statistical significance made anything definite. Additionally, showing how a confidence interval could be used to assess statistical significance was not meant to suggest that is how they should be interpreted. ”

 

132 thoughts on “Attempts at providing helpful explanations of statistics must avoid instilling misleading or harmful notions: ‘Statistical significance just tells us whether or not something definitely does or definitely doesn’t cause cancer’

  1. It is bad enough when Dr. Gelman engages in this drivel. No idea who Kevin O’Rourke is or why Dr. Gelman allows this cretin to pollute a wonderful blog.
    PS How come my comments re: cancer and cell phones and leukemia and cell phone towers is being censored?
    https://www.cnn.com/2018/05/02/health/brain-tumors-cell-phones-study/index.html
    https://scienceblog.cancerresearchuk.org/2016/10/31/sellafield-radiation-and-childhood-cancer-shedding-light-on-cancer-clusters-near-nuclear-sites/
    https://www.cbsnews.com/news/cell-tower-shut-down-some-california-parents-link-to-several-cases-of-childhood-cancer/

    • Keith [sic] O’Rourke is a long-time commenter and contributor to the blog: a quick Google search takes you to their Google Scholar page, https://scholar.google.ca/citations?user=R064zwoAAAAJ&hl=en

      I don’t think your comments were censored; I saw at least one comment of yours containing these links on the previous (“horns and cell-phone towers”) post.

      You seem to be mixing together effects of radioactivity (ionizing radiation, e.g. from nuclear power plants) and several different kinds of non-ionizing radiation (e.g. cell-phone towers vs cell-phone use). You’re also citing a mixture of news stories about *concern* (e.g. “a cancer cluster occurred, some parents are concerned that it’s caused by cell-tower proximity”) and scientific studies.

      The conversation on this site is generally more polite than the average internet comment section (e.g. “drivel” and “cretin” seem unnecessary).

    • There’s a lot that’s been written on these topics so it’s hard to recommend a short list of resources but I think the most accessible stuff has been written in the past few years because it’s recent enough to address many controversies around the topics. I list them in order of difficulty (least difficult to most difficult)

      Denworth, L. (2019), “The Significant Problem of P Values,” Scientific American, Available athttp://www.scientificamerican.com/article/the-significant-problem-of-p-values/. https://doi.org/10.1038/scientificamerican1019-62.

      Amrhein, V., Greenland, S., and McShane, B. (2019), “Scientists rise up against statistical significance,” Nature, 567, 305. https://doi.org/10.1038/d41586-019-00857-9.

      (Discussion of the paper above on this blog: https://statmodeling.stat.columbia.edu/2019/03/20/retire-statistical-significance-the-discussion/)

      Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., and Altman, D. G. (2016), “Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations,” European Journal of Epidemiology, 31, 337–350. https://doi.org/10.1007/s10654-016-0149-3.

      Gelman, A., and Greenland, S. (2019), “Are confidence intervals better termed ‘uncertainty intervals’?,” BMJ, 366. https://doi.org/10.1136/bmj.l5381.

      Chow, Z. R., and Greenland, S. (2019), “Semantic and Cognitive Tools to Aid Statistical Inference: Replace Confidence and Significance by Compatibility and Surprise,” arXiv:1909.08579 [stat.ME]. https://arxiv.org/abs/1909.08579.

      Greenland, S. (2019), “Valid P-values behave exactly as they should: Some misleading criticisms of P-values and their resolution with S-values,” The American Statistician, 73, 106–114. https://doi.org/10.1080/00031305.2018.1529625.

    • You might also try the notes linked from https://web.ma.utexas.edu/users/mks/CommonMistakes2016/commonmistakeshome2016.html .

      They start with fundamental (but common) misunderstandings involving uncertainty (such as “Expecting too much certainty”, “Terminology-inspired confusions”, and “Mistakes involving causality”), since these basic misunderstandings contribute a lot to misunderstanding statistical inference.

      • Thanks.

        When people are putting together explanatory talks on statistics, they would benefit by checking with your notes and the included links.

        Also sending around to statistical colleagues before they give them.

        • I see what happened in the Jennifer Rogers tape on P-values. The intended point is the one she makes at 8:00, that two different risks can be associated with the same statistically significant increase, yet correspond to different magnitudes of risk increase. What happens is, feeling the need to “discredit” statistical significance, spokespeople often pick up on the over-the-top or mangled criticisms that abound. For example, in Wasserstein et al, 2019, we hear that ” the seductive certainty falsely promised by statistical significance.” This summarizes the point in the McShane article: ” the false promise of certainty offered by dichotomous declarations of truth or falsity—binary statements about there being ‘an effect’ or ‘no effect’. The allegation is baffling to any statistical significance test user, because every inference is qualified by an error statistical assessment. However, if you learn your “statistical significance no nos” from these critical guides, then it’s not surprising that you might echo the claim that a statistically significant result purports that there’s definitely a risk. That’s where we get the kooky claim at 7:15.
          For decades people knew how to state the correct (but not over-the-top) point: a P-value, by itself, doesn’t give the magnitude of the effect, a statistical significance report is often binary: there’s a statistically significant risk or not. In short, they would make the kind of correct claim she makes at 8:00 (the same statistically significant increase may correspond to different magnitudes of risk increase). By feeding people the latest over-the-top versions of well known, people have learned to echo the wildly fallacious, unwarranted claims we hear. After all, that’s what today’s critics aver, just as when Amrhein et al. claim that a non-statistically significant result purports to prove the truth of a point null hypothesis (therefore ban the word significance). That’s one of the big reasons that today’s villification of statistical significance is doing more harm than good.
          https://errorstatistics.com/2019/11/14/the-asas-p-value-project-why-its-doing-more-harm-than-good-cont-from-11-4-19/

        • My comment leaves out “criticisms” in the sentence:
          “By feeding people the latest over-the-top versions of well known criticisms, people have learned to echo the wildly fallacious, unwarranted claims we hear.”

        • Mayo, why do you think frequentist statistics is so difficult to teach right?

          To put this in perspective consider the following timeline:

          (1) A century before Fisher, Laplace was routinely doing great science that held up well, using bayesian tests indicating when phenomena rose above the measurement noise.

          (2) All through the 1900’s excellent science was being done without Fisher, Neyman’s, or Pearson’s help. On a per capita, or per dollar basis, they were likely doing more good science than in our lifetime.

          (3) Then p-values make an appearance. The key mathematical step of which is a tail area integration trivial to anyone in freshman calculus. As a definition, it easily falls within the bottom half of ideas STEM majors regularly digest.

          (4) An 80 year monopoly on the teach of statistics by the all the great champions of Frequentist statistics, their students, and fellow lovers of Frequentism. For a good 50 years they had a rock solid impenetrable monopoly on it. For the average scientist today this monopoly hasn’t cracked at all.

          Given 1, 2, 3, 4, why are significance tests so stubbornly difficult teach right in your opinion? Or is this just one of those “no true Scotsman” fallacies, like saying “communism didn’t fail, it’s just never *really* been tried”?

        • There seems to be something “seductive” about p-values, that prompts people (who may constitute the majority of people) who crave “THE answer” or “THE method” (or perhaps who crave certainty in general) to see (or interpret?) p-values as THE answer to what they crave.

        • Scientists don’t get what p values mean, but they’re pretty happy to interpret them as “the probability that I’m wrong”… so they see “the probability that I’m wrong is 0.00023” and think “this is awesome!”

        • Anonymous, your timeline is quite off. P-values did not appear after Laplace, they were already being calculated before he was even born. In fact, Laplace also calculated them! Both him and Arbuthnot calculated them when looking at births.

          See Stigler (1986) “The Measurement of Uncertainty Before 1900”

        • I own the book. I meant p-values in it’s modern incantation since the 1930’s as part of the main system of significance testing. You can’t put every caveat imaginable in a blog comment. For example, some of those teaching frequentist methods were actually Bayesians who were stipulated by their department that they had to teach frequentist methods.

  2. I feel much better. I thought this was like a blog-specific rejection of the basis of all scientific research. but clearly i was wrong. thank you to Zad and apologies to Keith. I cannot really believe that the editor of psychological review has stopped caring about p-values. or the editor of science or nature or cognitive neuroscience. seems like a weird little faux-controversy. will read the scientific american article as soon as i score some ketamine.

  3. Thanks for posting this Keith. I’m really astonished that anyone in statistics could claim that a single number or decision-making framework could definitively tell them something about a phenomenon. I guess Aschwanden nailed it with her article on P-values, even those who have thought and written about them for a long time have difficulty explaining them

    https://fivethirtyeight.com/features/not-even-scientists-can-easily-explain-p-values/

    • From the article:

      “The p-value won’t tell you whether the coin is fair, but it will tell you the probability that you’d get at least as many heads as you did if the coin was fair. That’s it — nothing more. ”

      This is a great example. The simple experiment – a coin flip – can’t possibly go wrong.

      Now translate that “coin flip” into the context of the power pose: “the probability that you – a student posing as a job candidate – would get a fake job, offered by another student posing has a person offering a job – after doing a power pose in the mirror”. Then remember that the coin flip is an actual event, but your power pose experiment is a simulation of an actual event, roughly comparable to a five year simulating cooking dinner with cardboard kitchen appliances.

      P is perfectly legitimate for simple experiments, and as a cut-off value in assessing some minimum level of efficacy. If people would just use it for those experiments, it would be fine.

    • Definitely ain’t easy as in the last sentence of your link “You can get it right, or you can make it intuitive, but it’s all but impossible to do both.”

      That’s why I did not try and do that in this post. For one reason it would make it too long and was counting on some suggested references that would likely be better.

      By the way, I noticed Stephen Goodman’s quote to this effect earlier and did this post to try try and deal with “what to make” of concept in science. https://statmodeling.stat.columbia.edu/2017/08/31/make-reported-statistical-analysis-summaries-hear-no-distinction-see-no-ensembles-speak-no-non-random-error/

    • “Not even scientists can easily explain p-values”

      If that is true, then they have no hope for the intricacies of priors and MCMC settings.

      Consider “Use of significance test logic by scientists in a novel reasoning task”, by Morey and Hoekstra (https://psyarxiv.com/sxj3r/) and find their experiment and interactive app of results here (https://richarddmorey.github.io/Morey_Hoekstra_StatCognition/index.html) and here (https://richarddmorey.shinyapps.io/explore/). In the article abstract, they say

      “Although statistical significance testing is one of the most widely-used techniques across science, previous research has suggested that scientists have a poor understanding of how it works. If scientists misunderstand one of their primary inferential tools the implications are dramatic: potentially unchecked, unjustified conclusions and wasted resources. Scientists’ apparent difficulties with significance testing have led to calls for its abandonment or increased reliance on alternative tools, which would represent a substantial, untested, shift in scientific practice. However, if scientists’ understanding of significance testing is truly as poor as thought, one could argue such drastic action is required. We show using a novel experimental method that scientists do, in fact, understand the logic of significance testing and can use it effectively. This suggests that scientists may not be as statistically-challenged as often believed, and that reforms should take this into account.”

      Justin

    • Zad: In wishing to express that stat sig tests do NOT report a posterior probability that X causes cancer, she wrote that they report whether X definitely (as opposed to probably) causes cancer. This, of course, is the conception of tests promoted in Wasserstein et al 2019, who wrongly speak of the “seductive certainty falsely promised by statistical significance”. They promise no such thing. Yet you don’t see people rising up to point out that they are distorting tests.

      • DEborah G. Mayo: I am puzzled by some of your comments. You seem to be saying that the idea that significance tests give false certainty comes from critics of NHST – have I understood that correctly?
        That seems at odds with my experience (to put it mildly).

      • Deborah:

        Wasserstein et al. do not promote the view that “Statistical significance just tells us whether or not something definitely does or definitely doesn’t cause cancer.”

        Wasserstein et al. promote the view that statistical significance is wrongly believed by researchers to tell they whether or not etc.

  4. I’d like to morph your first line slightly:

    “Science is what it is, but explaining it as what it ain’t just so it is easy to understand…should no longer be tolerated. “

  5. Nice, I liked the video @Zad

    So who has a 30 second elevator pitch explaining a p-value?

    I’ll start (fire away!)

    What is a p-value? The probability of sampling from a null distribution and seeing a test statistic as large or larger than the one you found.

    What is a null distribution? The distribution of test statistics that would occur if we were to compute a test statistic for every possible permutation of treatment assignment.

    What is a null distribution? The distribution of test statistics that would occur if we were to replace the set of responses with random numbers generated by a specific random number generator. (my interpretation of Daniel Lakeland)

    Okay, almost no one in any elevator I’ve been in would understand these.

    • Indeed, this is one of many reasons that ‘classical stats’ should no longer be taught as the 101 default material. These concepts are actually quite subtle and easy to abuse! The reason we’re not seeing reform after all this endless discussion is that there is no simple alternative – teaching probability, basics of math modeling and Bayes takes time and sufficient mathematical maturity. That’s a huge pill to swallow given where the incentives are right now…

      • Indeed, if you’ve been doing this stuff for a couple decades, it’s easy to forget how hard the learning curve can be. And if we waste a lot of people’s time teaching them how to carry out t-tests and chi-squared tests by hand then by the time they’re supposed to be able to apply themselves to scientific questions and they discover that they’ve been cheated… there’s a natural revulsion to the idea of starting over from scratch.

      • “Indeed, this is one of many reasons that ‘classical stats’ should no longer be taught as the 101 default material. These concepts are actually quite subtle and easy to abuse! The reason we’re not seeing reform after all this endless discussion is that there is no simple alternative – teaching probability, basics of math modeling and Bayes takes time and sufficient mathematical maturity. That’s a huge pill to swallow given where the incentives are right now…”

        Are you claiming that Bayes methods wouldn’t get abused? Now that is hard to swallow.

        Justin

        • I agree to the that Bayesian methods can be abused — so I think that what is needed is to integrate probabilistic thinking into science teaching. And it wouldn’t hurt if this started in secondary education — which means that secondary science teachers need to have a good background in probability, and how it is part and parcel of most aspects of science, including scientific discovery.

        • Justin, look at main message of my post. Of course Bayes can be abused! I’m pointing out that teaching a rigorous approach that leads to fundamentally better science and stat modeling takes a lot of time and energy…I got no problem with Sander Greenland style use of p values fwiw.

  6. It could not have more respect and admiration and appreciation of Columbia University statistics professor Andrew Gelman for creating a forum for serious discussion of a wide range of topics involving statistics – from addressing contentious empirical hypothesis (does repeated close proximity of a radioactive device to the cranium, increase, however slightly the risk of glioblastomas, gliomas and other thankfully rare brain tumors?) to abstract meta-mathematical questions (even though a+b=c is equivalent to a=c-b, in the real world, one formulation might provide more useful).

    https://statmodeling.stat.columbia.edu/2019/12/15/causal-inference-and-within-between-person-comparisons/

    his is the only blog by an academic that truly contributes to academia that I have seen.

    That being said, I find this post so scientifically illiterate, sexist, arrogant to be deemed not worthy of this stellar blog. The punchline is basically. the leading scientist in australia, Dr. Jennifer Rogers, is a stupid lady.
    https://en.wikipedia.org/wiki/Jennifer_Rogers
    She is stupid. I am a smart man. There really is no other point, is there?.
    you take a sentence out of a half hour public lecture, trying to communicate the difference between:
    a. is there an effect (statistical significance) and
    b. what is the effect size or real world relevance?
    your implication is that men (like you and gelman and every other poster of this blog, right? – what percentage of total posts are by women? .0000001%? 0?) do not sometimes distort the truth to communicate an idea more clearly.
    she is saying “question 1 is whether there is an effect of one variable, say exposure to cell phones and another variable, probability of glioma. this is statistical significance. question 2 is how significant is the effect?” if cell phone exposure increases the risk of glioma from 1 in a trillion to 5 in a trillion, p<.00001, so what?"
    that is a deep, important, difficult to convey point.
    can you show me another post on this blog where you take a video of a prominent statistician speaking to a lay audience and take one sentence out and "critique" the out of context sentence? is there any video of a male Ph.D. statistician where you link to minute 23:23 and spend 500 words critiquing an out of context sentence? This is what we expect from New Yorkers like Trump and Giuliani, not Gelman.
    this type of scientific fraud and sexism cannot be kosher at columbia, can it?

    • Beatrice:

      Thanks for the kind words about the blog.

      Regarding the rest of your comment: I don’t see anywhere in Keith’s post where he said that Rogers was stupid or that he (Keith) was smart. Nor did he say or imply that men do not distort the truth (search this blog for Wansink etc) or make serious mistakes when trying to communicate (see, for example, Stupid-ass statisticians don’t know what a goddam confidence interval is and Bigshot statistician keeps publishing papers with errors; is there anything we can do to get him to stop???. Also, it’s not clear to me that Keith is taking any statements from Rogers’s speech out of context.

      Here is what Keith wrote above:

      Again getting across (scientifically) profitable notions of statistics to non-statisticians (as well as fellow statisticians) ain’t easy. We all are at risk of accidentally giving incorrect definitions and explanations. Unfortunately those are the ones folks are most likely to take away as they are much easier to make sense of and seemingly more profitable for what they want to do.

      This has nothing to do with whether the speaker is a man or a woman; indeed Keith says, “We all are at risk of accidentally giving incorrect definitions and explanations.” I think he’s right that we’re all at risk here: I’ve given incorrect definitions and explanations, and it’s a problem, something worth addressing for all of us. The statement, “Statistical significance just tells us whether or not something definitely does or definitely doesn’t cause cancer,” is a problem; this has nothing to do with whether the speaker happens to be a man or a woman.

      • I did consider changing the pronoun for Rogers from she (as it was on their homepage) to they on my post as a way to head off a criticism like this, but thought it would unlikely help.

        In some ways I am caught off guard by these. Of my 10 top cited papers on Google Scholar, 5 had women and all with significant or lead roles in the research and 8 of the 10 were published before 2000. So to me we got past this years ago, even though I know we didn’t.

      • I have to agree with Beatrice in that the tone/wording of the post felt more personally-attackish than is usual for this blog. I especially didn’t like how “(and thereby likely to make you more popular)” comes close to ascribing selfish intent to Rogers. Rogers also makes some good points in the part of video I’ve seen (about the importance of absolute risk magnitude) and I find it uncharitable that the post reads as if the video was almost exclusively a load of nonsense. This would IMHO be inappropriate regardless of the gender of the target of the critique. You could even argue that this is a bit worse against a women as it is much easier to provoke an online mob to bash woman than a man (not sure to what extent this is happening here).

        On second reading of the post I see you could argue you are not _technically_ that harsh, but my feeling (N=1) on the first reading was quite negative, so that’s just my feedback I want to offer.

        • Martin:

          I don’t think anything that Keith wrote in the post was anything near as “personally-attackish” as Beatrice’s comment, which described Keith’s post as being “scientifically illiterate, sexist, arrogant” etc. That said, I think it’s fine for Beatrice to share all this in this comment thread where it is possible to reply to her statements, just as I think it’s fine for Keith to express his views in a post. As discussed elsewhere in this thread, it’s hard to explain statistics without making mistakes, so I agree that it should not be taken as a personal failing when we make such mistakes. We all just have to work this. P-values, Bayesian statistics, etc., probability theory in general: they’re all tricky, and it’s easy for experts to mess things up in our explanations.

    • Unfortunately this sounds mostly like trolling to me. In general the comment section here gets much less trolling than other parts of the internet. But I suppose it’s inevitable eventually. I’ll also point out that Martha Smith, Diana Senechal, Elin, and several other regular contributors are women and are generally treated with the same level of respect that others are treated with on this blog. Women are a minority but nothing like the tiny minority implied here.

    • Beatrice: The important lesson is one that’s not being identified here: She isn’t defining p-values. She’s trying to make the correct point (at 8:00) and along the way repeats the claim she learned in her “here’s how to discredit statistical significance” playbook (about the “seductive certainty falsely promised by statistical significance”)–written by many critics of stat significance right here. See my comment.

      https://statmodeling.stat.columbia.edu/2019/12/18/attempts-at-providing-helpful-explanations-of-statistics-must-avoid-instilling-misleading-or-harmful-notions-statistical-significance-just-tells-us-whether-or-not-something-definitely-does-or-defin/#comment-1207722

  7. You would think the National Academies of science, writing a guidebook on replication and staffed with leading statisticians, would define P-values correctly. They don’t. I have written to them; the committee leader told me he was very happy to have my corrections, but it doesn’t seem anything can be done to issue an errata. I think these erroneous interpretations have gotten worse because a blessing has been given to using p-values as a kind of likelihood in a quasi-Bayesian computation–I call it the diagnostic-screening model of tests. The more that significance test critics encourage a hodge-podge of “reforms” along these lines, the more distorted P-values are becoming.
    https://errorstatistics.com/2019/09/30/national-academies-of-science-please-correct-your-definitions-of-p-values/

    • I agree that since likelihood has a technical meaning, it shouldn’t have been used in its colloquial meaning.

      I really like to hear you say that you don’t approve of “using p-values as a kind of likelihood in a quasi-Bayesian computation–I call it the diagnostic-screening model of tests” because I think it strengthens my own point that in fact most people who claim to be doing Frequentist statistics, but are doing things like mixed models are doing Bayes with flat priors and interpreting their results in more or less Bayesian ways.

      Of course from my perspective this is good because I want to simply say “hey you’re doing shabby Bayes already, just go ahead and take the next step and do it right” but by having comments like yours on the blog to point to I can then argue that they’re just mistaken when they say “oh no, these are entirely frequentist ideas”

    • Deborah:

      You write, “You would think the National Academies of Sciences, writing a guidebook on replication and staffed with leading statisticians, would define P-values correctly. They don’t.”

      It is not a surprise that the National Academies of Sciences gets things wrong. Don’t forget, they publish the journal PNAS, which is notorious for publishing junk science such as ages-ending-in-9, himmicanes, etc. I’m sure the National Academies of Sciences does lots of good things, but a lot of what they do is to reify the eminence of their members, so if their members make mistakes, that’s gonna be a problem. They have an intellectual conflict of interest. I’m not talking about $ here, I’m talking reputation.

  8. I don’t get where pointing out a potential cause for an increase in fallacious defn of P-values suggests I’m doing Bayes with flat priors in the least. I assure you I am not, but never mind:
    Here’s one of their defns.

    (1) Scientists use the term null hypothesis to describe the supposition that there is no difference between the two intervention groups or no effect of a treatment on some measured outcome (Fisher, 1935). (2) A standard statistical test aims to answers the question: If the null hypothesis is true, what is the likelihood of having obtained the observed difference? (3) In general, the greater the observed difference, the smaller the likelihood it would have occurred by chance when the null hypothesis is true. (4) This measure of the likelihood that an obtained value occurred by chance is called the p-value. (NAS Consensus Study p. 34)

    • No, sorry if I wasn’t clear. YOU are doing your thing, and I fully agree it’s not Bayesian. And other people I talk to *think* that they are doing *your* sort of thing, but they aren’t and having you claim that quasi-Bayesian interpretations are erroneous is very helpful because it shows that the people I talk to who think they’re doing your sort of thing but are actually doing a quasi-Bayesian thing are in fact mistaken.

      • Hey Daniel, I saw a tweet I liked by your doppleganger Daniel Lakens (https://twitter.com/lakens/status/1201389544853123072):

        These discussions will go nowhere as long as it mainly attracts zeolots who have applied whatever solutions they propose themselves less than 100 times.

        Unfortunately, these discussion attract zeolots, some of whom write entire books on the subject, who have applied the solutions they propose less than 1 time.

        • I’m not a big twitter fan obviously preferring to write excessively lengthy posts over here instead ;-)

          To me the discussions about statistics and meaning are good even if they involve a lot of zealotry, because they point out that it’s not all solved problems handed down by Fisher or Neyman or Cox or whoever. Just the existence of controversy enables the discussion of what we think should be done as opposed to something like say the quadratic formula, where if you get something other than [-b +- sqrt(b^2-4ac)] / 2a we know for sure you’re wrong

          Doing applied stats is hard, and I don’t think you should have to be an applied mathematical modeler / statistician to have an opinion about how statistics should be done, but I will say that it helps to have confronted a bunch of real world data and had to try to do something useful with it.

        • I’m envious of those who go their entire career without having to confront reality in even the tiniest and slightest of ways. What bliss it must be to be immune to failure. We should coin a phrase for statistical solutions the author never even tried once. How about:

          “ideas that haven’t passed hard tests”

          no that’s not it. How about,

          “ideas that haven’t passed difficult tests”

          no that’s not it either. I feel like I’m getting close though.

          “ideas that haven’t passed strenuous tests”

          Closer, but not quite there yet. How about …

  9. The Beatrice Pascal statement should be considered a “po” statements. What you use if you want to sturr creativity. In that context you make an absurd statement that is designed as a provocation aiming at raising reactions. In order to make sure everyone understands the purpose, you precede it with the term “po” followed by your statement. https://en.wikipedia.org/wiki/Po_(lateral_thinking)

    The provocation statement of Pascal lead to interesting comments. They all focused on the interpretation of p-values. None on alternative representations of findings.

    Part of the acrobatics of Jennifer Rogers is due to here attempt to bridge two worlds. The non statistical world is using verbal statements. The statistical jargon in the statistical world is used to qualify such statements. If we take a step back, we (statisticians) can become better at formulating verbal statements.

    In the more general framework of information quality (Kenett and Shmueli, 2016), the point is that generalisations of findings is also a task to be performed by statisticians, in collaboration with domain experts.

    I proposed doing that in the context of pre-clinical and clinical research with a methodology based on alternative representations, some with meaning equivalence and some with surface similarity. To test these, I suggested invoking the S-type error approach of Gelman and Carlin.

    The statement could be:”something (X) enhances the occurrence of cancer (Y)”. The S-type error being that increasing X, actually reduces Y. Controlling for such errors, is done by proper design and powering of study. Calculation of the S-type error uses a computation based empirical Bayes like approach accounting for the study design. Clinicians understand this formulation of findings. A table with alternative statements would then be presented in a section on generalisation of findings. But I wrote about this before, with examples….

      • Daniel:

        Agreed. And, just to be clear: following the thread, it does not seem to be Rogers who got the twitter war started.

        I guess the real lesson here is that, given that arguments will get played out on twitter, we have to be careful when blogging to anticipate possible twitter misinterpretations when writing our posts. Given that Keith began his post with, “Getting across (scientifically) profitable notions of statistics to non-statisticians (as well as fellow statisticians) ain’t easy,” I’m not quite sure how much more he could’ve done. I guess it would’ve helped for him to have added a sentence such as, “It’s hard for you and me to do this too.”

        Writing is tough. Sometimes we need to bend over backward to avoid misinterpretation, but too much bending-over-backward can read like throat-clearing (sorry for the mixed metaphor) and can frustrate readers. So it’s not always clear what to do. In retrospect, given the twitter thread that occurred, I think some more throat-clearing would’ve helped in the above post (as is the case for some of my posts too!), but I wouldn’t’ve thought it necessary ahead of time.

        • OK, I’ll put some thoughts for now here.

          We (I) need to do better next time, the purpose of such a post is not to start a twitter war with people try to diminish and discredit each other (or to give Andrew even more distractions) but to draw attention to what one _believes_ is an error, have it assessed as potentially such and if so, corrected so the community can move forward _together_.

          I don’t want to believe a statistician can not criticize another statistician without them _needing_ to become enemies!

          Unfortunately, I meant to mention I found Rogers explanation of regression to the mean rather good but given the time I spent verifying other parts of her talk, I ran out of time. However, I am happy few if any comments here are overly negative about her.

          Now, what I did do (in addition to what Andrew already pointed out).

          I did point out some things that were favorable “she did give a correct definition of p_values and at another point she explained a confidence interval as an interval of plausible values”.

          I also emailed her soon after the posting, inviting comments or email to me (don’t know for sure that she got that or replied).

          Other suggestions?

        • Confidence Intervals definitely aren’t an interval of “plausible values”. There’s nothing in their construction forcing or indicating this is true, and in general, you can’t assume that interpretation except in the simplest examples where it approximates a bayesian credibility interval.

          In fact, confidence intervals in real problems can be a range of provably impossible values and this can be proved from the same assumptions/data used to construct the interval.

          So in some cases confidence intervals are a range of impossible values.

        • This is a key point since people almost always use CI’s as ranges of plausible values. It’s worth considering what goes wrong here.

          A CI is constructed by a method which will yield coverage of the true value (1-alpha)% of the time if repeated. (This assumes the repeated frequency of occurrence is similar to the probability distribution, which usually isn’t true, but assume it is true for the sake of argument.)

          Nowhere in that definition does it imply or require the CI constructed in an individual case to represent “plausible values”. The entire interval could in fact be impossible values, just as long as the coverage frequency in hypothetical repeated trials works out right.

          This can happen in practice as soon as you get away from problems that have nice simple sufficient statistics (using them as the estimator).

        • Technical points taken, I mistakenly took plausible as a synonym for compatible but on checking a thesaurus, better ones would be probable or credible. I was was _wrong_ about the statement not being wrong in the unfortunate sense of confusing a confidence interval with a credible (probability) interval.

          However, I was not so wrong saying “all statisticians need other statisticians to help them avoiding explanations of statistics that may instil misleading notions”.

          On the other hand, my skipping over the technical point that intervals with a given coverage (that defines an interval as a confidence interval) may include egregiously incompatible or even impossible values) was one of those harmless lies or simplifications.

    • Patrick:

      Thanks for the link.

      Rogers writes, “This snapshot was actually part of a wider discussion where I discredit statistical significance and say that it doesn’t quantify risk in any way, which is arguably more important to the general public. And that just because two things may reach ‘statistical significance’ doesn’t mean that they should be judged equally and that their risks are the same.”

      I think this is consistent with O’Rourke’s statement in his post, “getting across (scientifically) profitable notions of statistics to non-statisticians (as well as fellow statisticians) ain’t easy. We all are at risk of accidentally giving incorrect definitions and explanations.” and his further statement in comments, “no enemies here just allies who are having difficulty understanding what to make of what other’s are writing.”

      I assume that Rogers would agree that the statement, “Statistical significance just tells us whether or not something definitely does or definitely doesn’t cause cancer,” is incorrect. As Rogers writes, this is part of a larger discussion in which she was not recommending that people use statistical significance. I think it’s fine for O’Rourke to have pointed out the error, also find for Rogers to note the larger context.

      As Kennett writes above, “Part of the acrobatics of Jennifer Rogers is due to her attempt to bridge two worlds.” This is an acrobatics that we all have to do—unless we withdraw from public communication entirely, which would be irresponsible. Thus, I applaud Rogers for giving this public talk explaining some important statistical issues to a general audience—and I also applaud O’Rourke for pointing out specific problems in Rogers’s talk. Both Rogers and O’Rourke are performing valuable services, and there’s no reason to think of them as being in conflict or competition here.

      One other thing. In the linked thread, Rogers writes, “There is even a blog post doing the rounds calling into question my previous appointment at @OxfordStats and my vice-presidency at the @RoyalStatSoc.” I’m not sure if she’s referring to the above post, but if she is, I think she might be misunderstanding. What O’Rourke wrote was, “Jennifer Rogers was the Director of Statistical Consultancy Services at the University of Oxford, an associate professor at Oxford and still is vice president for external affairs of the Royal Statistical Society. And has a TEDx talk list on her personal page. How could they have gotten statistical significance so wrong?” This does not call into question Rogers’s appointments. It just demonstrates that even highly credentialed experts can make errors in public communications.

      Look. I’m a professor at Columbia etc. And I also published an article which, in its very first paragraph, contains the gross error of defining the p-value as “the probability that a perceived result is actually the result of random variation.” This is wrong. Wrong wrong wrong wrong wrong. And it would be fine for someone ask how a professor at Columbia could make such a mistake. In politics we say that no one should be above the law. Similarly, in academic and scholarly discourse, no one should be above the law. If I, Columbia professor, publish an error (in this case, a wrong description of the p-value), anyone should be able to point out that mistake in public. And they should be allowed to marvel that someone with my academic affiliation made this mistake. That doesn’t call my appointment into question.

      So, again, I think it’s great that Rogers is engaging in public communication, and it’s no slam on her that she’s sometimes gotten things wrong, just as we all have. It’s also great that O’Rourke went to the trouble of pointing out some issues here, also great that people have gone to the trouble of commenting on this blog, which allows us to clarify some of these issues, and we can all do better next time!

      • + 1. Also, Rogers had the tough task of explaining statistical significance, which I would argue is more difficult than just explaining what a P-value is, because it requires you to first explain what a P-value is, and then go one further step to explain how statistical significance is a decision-making framework with flaws, etc.

        It’s clear to me that she was trying to get the point across that statistical significance doesn’t quantify risks/effects, so she may have had the intention of discrediting statistical significance, but when you say it tells you whether or not you definitely have or don’t have cancer, it really is the opposite of discrediting. Any statistic that could do that would be near omnipotent.

        Again, not easy at all to explain that stuff in a talk where you’d want to avoid overloading your audience with confusing concepts, but again, it is incorrect, and does need to be pointed out. I think Keith’s post was fair in doing this.

      • Andrew – you are one of the few who has no problem self retracting a study because of a retrospective mistake in calculation or approach. i did not keep a list, but over time, i believe you did this several times. this is a great role model to be adopted in statistics at large.

        Another dichotomy (sorry for using this term that seems banned in some statistics circles….) is to consider what I call the “here and now”, which involves evaluating your current results, from a “forward looking view”, such as a generalisation of findings, transportability a la Pearl and Bareinboim or severe testing as advocated by Mayo.

        Foreword looking gets you to look at the next steps, and sometimes review previous steps which can lead to a retraction.

        Dealing with these aspects provides a constructive discussion on the contribution of statistics to science, industry, business, government and society. We also need to better communicate with the users/customers of statistics that follow the statistics wars with a puzzled look, because of its seemingly destructive flavour.

      • Andrew: I don’t see that what she said on her way to the main point (that identical P-values can correspond to different magnitudes of risk) is different from the allegation that statistical tests purport to offer ”the false promise of certainty offered by dichotomous declarations of truth or falsity—binary statements about there being ‘an effect’ or ‘no effect’” in McShane et al 2019. While the allegation is baffling to statistical significance testers, because every inference is qualified by an error statistical assessment, it’s no surprise that when critics popularize that wrong view, that others raising criticisms of tests will also depict them in that way.

        • Mayo:

          Rogers said, “Statistical significance just tells us whether or not something definitely does or definitely doesn’t cause cancer . . .”

          Had she said “Statistical significance is mistakenly used to tells us whether or not something definitely does or definitely doesn’t cause cancer . . .”, then that would’ve been essentially what McShane et al. said. The trouble with Rogers’s statement is that it’s false. Statistical significance does not actually tell us whether or not something definitely does or definitely doesn’t cause cancer.

          Regarding your other point: Sure, I’d be fine if Rogers and McShane et al. had further qualified their statements by saying something like, “Statistical significance, if wrongly used, can mistakenly lead people to think that it tells us whether or not something definitely does or definitely doesn’t cause cancer . . .”

          Given Rogers’s expertise in statistics, I’m sure that she doesn’t really think that statistical significance can tell us whether or not something definitely does or definitely doesn’t cause cancer. But that’s Keith’s point: even experts can make mistakes when writing or speaking, and this mistakes can mislead non-experts, hence the value of corrections.

        • Andrew: And if Wasserstein et al had said tests are mistakenly thought to promise certainty, a big part of the argument for banning the concept of statistical significance would go by the board. For example, in Wasserstein et al, 2019, we hear that ”the seductive certainty falsely promised by statistical significance.” This is to assert that the tests purport to give certainty, rather than embrace uncertainty. Or the McShane et al. article: ”the false promise of certainty offered by dichotomous declarations of truth or falsity—binary statements about there being ‘an effect’ or ‘no effect’.” The allegation (of denying uncertainty) is baffling to any statistical significance test user, because every inference is qualified by an error statistical assessment. I and some others argue that these are gross misinterpretations, but they’re stated as what tests purport to give us.

        • Deborah:

          I agree with McShane et al. that it’s bad to declare “dichotomous declarations of truth or falsity—binary statements about there being ‘an effect’ or ‘no effect’.” This is done all the time by users of statistical significance. I agree that it is not necessary that users of statistical significance make incorrect dichotomous statements—but they do.

        • Andrew:
          What is said in McShane et al. is that this is what statistical significance tests purport to do. It does not say that these are fallacious uses of statistical significance tests. A statistical significance test, we are told, “begins with data and concludes with dichotomous declarations of truth or falsity— binary statements about there being ‘an effect’ or ‘no effect’— based on some p-value or other statistical threshold being attained.”

          Repeatedly, statistical significance tests are claimed to output “deterministic” claims. The error statistical qualification is absent.

          There is no mention anywhere of what a statistical significance test does except for these erroneous construals, and no qualification that these assertions allude to abuses of the methods.

          I noticed Jennifer Rogers makes a similar statement as in the talk being criticized at 3:54 (it sounds like “does definitely cause”), but the next line is more apt:
          https://www.youtube.com/watch?v=FxQC2YMw8b8

          It seems to me that what’s said in McShane is actually stronger and more problematic, because, in this talk, she is referring to the kind of lists of known carcinogens–resulting from numerous tests–not isolated small P-values. She’s saying the form of the assertion, wrt known causal practices and substances that you read about, is often: X causes cancer, and her point is an ultra obvious one: the fact that two substances can cause cancer doesn’t mean they both are equally.

          In McShane et al., the allegations of what statistical significance tests purport to do aren’t even limited to a so-called NHST, which I admit is so often associated with a fallacious animal that we ought to drop the acronym (see Final Keepsake in my book: https://errorstatistics.files.wordpress.com/2019/04/souvenir-z-farewell-keepsake-2.pdf).

          NHST is mentioned in McShane et al. asking “can they conclude sodium is associated with—or even causes—high blood pressure as they would under the NHST paradigm?”

        • Andrew: It does not say that these are fallacious uses of statistical significance tests. A statistical significance test, we are told, “begins with data and concludes with dichotomous declarations of truth or falsity— binary statements about there being ‘an effect’ or ‘no effect’— based on some p-value or other statistical threshold being attained.”

          There is no mention anywhere of what a statistical significance test does except for these construals, and no qualification that these assertions allude to abuses of the methods. The allegations of what statistical significance tests purport to do aren’t even limited to a so-called NHST, which I admit is so often associated with a fallacious animal that we ought to drop the acronym (see Final Keepsake in my book):
          NHST is mentioned in asking “can they conclude sodium is associated with—or even causes—high blood pressure as they would under the NHST paradigm?”
          Repeatedly, statistical significance tests are claimed to output “deterministic” claims.

          I noticed Jennifer Rogers makes a similar statement at 3:54, with a clearer context:
          https://www.youtube.com/watch?v=FxQC2YMw8b8

          It seems to me that what’s said in McShane is stronger and more problematic because she is referring to the kind of lists of known carcinogens–resulting from numerous tests. She’s saying the form of the assertion wrt known causal practices and substances is often: X causes cancer, and her point is an ultra obvious one: the fact that two substances can cause cancer doesn’t mean they both are equally risky.

        • Yes, but Deborah who is qualifying what? That is one essential question, among many others, that a consumer of statistics have asked and should ask.

          Maybe conflating the term ‘dichotomous’ with ‘binary’ in the abstract; that is without describing the precise context there are confusing statements made.

        • The question of to what extent does statistical practice today result in errors of the type “misinterpretation of the meaning of statistical tests” is an empirical one.

          Suppose for example that I surveyed 1000 PhD level scientists in biology, medicine, psychology, economics, and other fields where statistical tests are routinely used.

          I give them some scenario involving collecting data and calculating some p values… and then ask them questions regarding what the conclusion should be, which they answer in a short paragraph form.

          What fraction of them would assess the evidence in an appropriate way according to Mayo’s suggestion that “every inference is qualified by an error statistical assessment” and would not “deny uncertainty” in any way?

          Just this week my wife showed me an email where someone outside her lab was analyzing some data collected in collaboration with her lab.

          It said something along the lines of “an ANOVA analysis shows a statistically significant overall effect, but the analysis of A vs B and A vs C shows no statistically significant effects, but B vs C and B vs D shows an effect even … etc etc”

          The researcher wrote in essence “I don’t know how this can be, because it doesn’t make any sense, the only thing I can think is that there must be a continuum of effects and statistics just can’t show it to us”

          (none of these are actual quotes, but rather paraphrases, I don’t have the email myself).

          The point is, the only thing this person who is a postdoc in biology can get out of their attempt at analyzing the data is that the results show conflicting information that there either is an effect or there isn’t an effect, but is incapable of really giving a coherent answer, because the truth is what they intuited, which is that the issue isn’t binary but as far as this person knows, all that stats can do is give them binary answers to yes no questions.

          Fortunately my wife has me to fall back on to get this data analyzed in a more sophisticated manner, but I assure you this is a *routine* situation that postdocs in biology or medicine who have taken biostats classes go off into the world committing statistics at the drop of a hat.

        • Lucky her.

          Unfortunately (or fortunately) I am sometimes tasked at taking down recently graduated statisticians making arguments such as given p > .05, there is nothing to be concerned about. Move on this is not the exposure level you need to be worried about.

          Sometimes I lose, mostly because the senior non-statisticians involved have taken intro stats courses (OK, I think I remember that being correct).

        • Keith, it seems like basic stats 101 is a major problem for science. it induces people to turn off their brain when it comes to understanding the meaning of scientific data. Like in the case where my wife was showing me the email, what if the researcher had just plotted the raw measurements on the bivariate outcome and colored the points based on the condition? instead of insisting on a proof of a difference or not difference, which is what they hoped to get from the stats analysis, they could have just acknowledged that there is variation, and yet a plot of the centroid of each group would probably have shown what they cared about, which is that certain types of cartilage are more similar to each other and other types are different… quantitative representation of their qualitative by-eye impression is their main purpose. Later they may want to look at treated samples and see which portion of the phase space they are in… but all they know to ask is yes/no is there a “significant” difference… sigh

        • Daniel,

          In my experience, quantative people (STEM majors with an MS for example) who had zero experience with stats are dramatically better at analyzing data when left to their own devices than those with formal exposure to stats. Even a single stat 101 course seems to nuke their ability to think about data and evidence well.

        • Daniel said, “it seems like basic stats 101 is a major problem for science. it induces people to turn off their brain when it comes to understanding the meaning of scientific data.”

          +1

        • Just saw another example of how Stat 101 nukes people’s ability to think about data:

          A scientist observes an entire population which has some kind of variation in it and, in effect, interprets the variation as due to sampling error. They immediately want to do a significance test to determine if the variation is “real”.

          There’s negligible measurement error, and the entire population isn’t being thought of as “one population among many” or anything like that. They simply can’t wrap their brain around the fact that they already know there’s a real variation.

          I can’t tell you how many times I’ve seen smart scientists make that error. I’ve never once seen someone unschooled in stats do the same, or even be tempted to make it, when left to their own devices when analyzing data.

        • Daniel said,
          “seems like basic stats 101 is a major problem for science. it induces people to turn off their brain when it comes to understanding the meaning of scientific data.”

          and Anonymous added, “Even a single stat 101 course seems to nuke their ability to think about data and evidence well.”

          Stats 101 needs to be taught in a way that at least makes a serious effort to prevent these undesirable effects. Below are some quotes from a first day handout that show how I have tried to forestall these common misunderstandings

          [Please note:
          1) The course was not a usual Stats 101 — it had a calculus-based Probability course as prerequisite, but much of what I give here would also be applicable to a “standard” Stats 101. 2) I don’t claim that these points worked miracles, but I think they helped set the tone for the course, and also gave me the right to say, “I told you so” if students complained that my grading was harsh.]

          “In many problems you will need to combine common sense and everyday knowledge with mathematical and/or statistical techniques.
          Some questions on homework and exams will not have one correct answer; your grade on such questions will depend largely on the case you make for your answer, rather than just on the answer itself.
          Reading assignments from the text will be given. These need to be read with attention to detail as well as to getting the general idea.
          Learning new technical vocabulary is important. Some of it will be new technical meanings that are different from everyday meanings of words.
          Writing carefully and precisely is important.”

          “I believe that it is not possible to evaluate accurately what you have learned and done in this class solely on the basis of problems that can be done within the time limits of an exam. Therefore homework problems and a project, which you can spend more time on, will be important parts of your grade.
          As mentioned above and below, grading will be based not just on the final answer or on calculations, but also on the reasoning shown in arriving at your final answer. ”

          “Homework: You will be assigned three types of homework:
          1. Reading assignments. The textbook is unusually well written, so we can make best use of it and class time by your doing reading assignments before coming to class. Then we can spend class time going over the more difficult parts of the reading, reinforcing and applying what you have read, and supplementing the text with some of the mathematical reasons behind the techniques. Be sure to read for understanding and not just superficially. Thinking about what you read, and about what we do in class, is important for learning statistics. Pay special attention to the points marked with the “caution” symbol in the margin of the book.

          2. Practice exercises. These will usually have answers summarized in the back of the book. You will not hand these in, but you will need to do them to help learn the skills and concepts that you will need to put together to do the problems on written homework assignments. Be sure to do them before the date they are assigned for, so you can ask relevant questions based on preparation and understand class discussion. (We won’t be able to discuss all practice exercises in class.) Usually practice exercises will be assigned together with the reading that they cover, to help you understand and assimilate the reading.

          3. Written homework. These problems will usually be longer and/or more involved than practice exercises and exam questions. Consider each written homework assignment as a mini-take-home exam. See Guidelines for Written Homework and Policy on Late and Make-up Work below. Also bear in mind that the answers in the back of the book are just summaries; your solutions to written homework need to be more detailed and show your reasoning more than the answers in the back of the book.”

          “Guidelines for written homework:
          1. Remember that one important purpose of written homework is to practice thinking statistically and to show me how well you have progressed in your thinking. Be sure to show your reasoning — I can’t evaluate it if you don’t show it. And keep in mind the following quote from the instructor’s manual for our textbook:
          ‘ If we could offer just one piece of advice to teachers using IPS, it would be this: A number or a graph, or a formula such as “Reject Ho,” is not an adequate answer to a statistical problem. Insist that students state a brief conclusion in the context of the specific problem setting. We are dealing with data, not just with numbers.’

          2. Do not hand in a rough draft! Be sure to spend time organizing and writing your solution. Ask yourself if you would like to read your write-up. If not, rewrite it! Part of your grade will be based on clarity of organization and explanation. After all, communicating well is part of thinking well — and making the effort to communicate clearly is an important way to develop your thinking.
          Do not hand in extra computer output. Cut and paste (either by hand or on a word processor) so that figures and computer output come as close as possible to the point in your discussion where you refer to them. In some cases, writing on computer output (especially printouts of graphs) will work.
          Reminder: Answers in the back of the book are summaries, condensed to fit in as little space as possible. Do not use them as models for written homework.
          3. Write in complete sentences.
          4. Pay attention to correct use of vocabulary. You will be learning technical vocabulary in this course. Part of what you need to learn is to use it appropriately. Be especially careful of what in language learning are called “false friends”: words that are familiar, but have a technical meaning that is different from their common meaning. “Significant” is one example of such a word.
          Also be careful not to use mathematical vocabulary inappropriately in a statistical context. In mathematics, we can often prove an assertion. In statistics, we can usually only conclude that our result supports, suggests, or gives evidence in favor of a conclusion.
          5. Use symbols correctly. One symbol often misused is the equal sign. Do not use it except to mean that the two things it is between are equal!!

          Exams:
          Do not expect exams to be just like homework. Exam questions will on average be less involved computationally than homework problems. They will often focus in more depth than homework on conceptual understanding. For example, some exam questions will test to see if you can distinguish between similar concepts. Others will be “summing up” questions to test how well you have been thinking as you learn. Others will provide you with computer output and ask you to answer questions based on that output and a description of the study from which it came.

          Class Attendance and Participation: This is important for two reasons:
          1. We will be covering material in class that is not in the textbook.
          2. Discussion is very helpful in learning statistical concepts and statistical thinking.
          Since the class is fairly large for class discussion, I will divide the class into two groups, which will alternate taking primary responsibility for responding to questions in class. When it is your group’s turn to be responsible, be prepared to put solutions on the board or the doc cam as well as answer questions on the reading and exercises. But remember that answers to questions that have answers in the back of the book usually need to be more detailed than the answers in the back of the book, need explanations, and need to be rephrased in your own words.
          Of course, you will need to do assignments for all days, since one day’s assignment typically builds on the previous days’.
          Please note: I expect students to make mistakes in class participation. Sometimes we learn best from our own or others’ mistakes. What I look for in class participation is that you are trying, and thinking.

          Ethical matters:
          Statistical ethics: Statistics consists of a collection of tools which, like any tools, can be used either for good or ill. It is your responsibility as a citizen of the world to be sure not to misuse these tools. I encourage you to read the Ethical Guidelines for Statistical Practice developed by the American Statistical Association, available on the web at http://www.amstat.org/profession/index.cfm?fuseaction=ethicalstatistics

          Authorized and unauthorized collaboration: Since the University defines collaboration that is not specifically authorized as academic dishonesty, I need to tell you what collaboration is and is not authorized in this class.
          The following types of collaboration are authorized:
          1. Working on homework with someone who is at roughly the same stage of progress as you, provided both parties contribute in roughly equal quantity and quality (in particular, thinking) to whatever problem or problem parts they collaborate on. In fact, I encourage this type of collaboration!
          2. A moderate amount of asking, “How do I do this on (the statical program used)?” However, as you gain enough familiarity, you should get in the habit of using on-line help and trying logical possibilities, then asking for help only if these don’t succeed after a reasonable try.
          The following types of collaboration are not authorized:
          1. Working together with one person the do-er and one the follower.
          2. Any type of copying. In particular, splitting up a problem so that different people do different parts is not authorized collaboration on homework. (A certain amount of this may be appropriate on your project.)
          3. Possession or consultation of the Instructor’s Solution Manual.
          Academic dishonesty aside, asking anyone, “How do I do this problem?” (as opposed to questions like, “How do I carry out this detail of this technique?” or, “I’m not sure whether to proceed this way or this way; here is my thinking about each possibility; am I missing something?”) is just cheating, since it avoids the most important part of learning statistics: developing your statistical thinking skills.”

        • Martha, your notes seem like you probably had a great class. I always liked creating projects that tied multiple ideas together. I wrote up a series of projects for teaching an engineering computing course. They started with the equations of motion of a ball in 2D, then used dimensional analysis to derive a drag expression, then some data led to a regression to find an expression for a drag coefficient, then ideas about how to solve ODEs by iterative methods, taught looping, how to interpolate to find a distance horizontally at projectile impact, and then ideas of optimization, finding angles for maximum distance trajectory… doing inference on fluid viscosity by shooting a ball at a known speed… All building on a simple idea and adding complexity naturally as you asked more questions. If course I wasn’t in charge of the class, so only a few of the lessons got used.

          Most students didn’t like it at USC, they seemed to want quick simple and certain answers to textbook type problems. low risk. But you always could tell who were the best students because they would eat that stuff up.

          which textbook did you use for your class?

        • Daniel asked,
          “which textbook did you use for your class?”

          Introduction to the Practice of Statistics, 5th edition, 2005/6, by Moore and McCabe

        • Deborah:

          Also we should clarify: in your comments you attribute the following phrase to McShane et al.: ”the false promise of certainty offered by dichotomous declarations of truth or falsity—binary statements about there being ‘an effect’ or ‘no effect’.”

          McShane et al. never write this in their article. The closest is this: “In brief, each is a form of statistical alchemy that falsely promises to transmute randomness into certainty, an ‘uncertainty laundering’ (Gelman 2016) that begins with data and concludes with dichotomous declarations of truth or falsity—binary statements about there being ‘an effect’ or ‘no effect’—based on some p-value or other statistical threshold being attained. A critical rst step forward is to begin accepting uncertainty and embracing variation in effects (Carlin 2016; Gelman 2016) and recognizing that we can learn much (indeed, more) about the world by forsaking the false promise of certainty offered by such dichotomization.”

          I think it’s best to avoid confusion by not putting non-quotes in quotation marks.

          In their paper, McShane et al. are clearly talking about what people wrongly do with statistical significance, not what they should or must do.

        • Andrew: The quote is from Wasserstein et al. 2019:

          “McShane et al. in Section 7 of this editorial. ‘[W]e can learn much (indeed, more) about the world by forsaking the false promise of certainty offered by dichotomous declarations of truth or falsity—binary statements about there being ‘an effect’ or ‘no effect’—based on some p-value or other statistical threshold being attained.’”

          From this they extract the “seductive certainty falsely promised by statistical significance”.

          You publish so much, I see now there is another co-authored McShane et al. 2019, without a, b, so I can see why the person I asked to look up the references for my blog comment assumed it was one and the same.

          https://errorstatistics.files.wordpress.com/2019/05/wasserstein-et-al-2019-moving-to-a-world-beyond-p-0-05.pdf

    • the added full stop on the youtube of jennifer rogers is like a hanging comma in “lies, damn lies; and statistics” an alternative representation with surface similarity to “”lies, damn lies and statistics”. looks the same but has a totally different meaning.

      an alternative representation with meaning equivalence would be “”there are lies and damn lies, and, on the other side, is statistics”

      the idea of verbal generalisation of findings is to produce a list of statements with meaning equivalence, representing the research findings and, next to it, a list with alternatives representations with surface similarity.

      this would communicate what is claimed as findings of the research and a list of what is not claimed, i.e. not supported by the research. for an examples in the context of treatment of allergies see doi:10.1111/pai.13115, 2019

  10. The laughable misuse of the phrase “radioactive device” in this context rather underlines the fundamental lack of seriousness involved, don’t you think? As a great man once observed, “You keep using that word. I do not think it means what you think it means.”

  11. P-values are fun to think about. Suppose you have a test statistic T defined on the unit circle. The null hypothesis is that the parameter mu is at the top of the circle (the point at angle pi/2). Under this null hypothesis, the distribution of the test statistic is max at the top and decrease as you go further down, and is smallest at the bottom (angle -pi/2).

    Now the value of the test statistic T(data) is the point at -pi/4. Given this, how do you compute the p-value?

    Do you include just the region from -pi/4 to -pi/2? or do you include -pi/4 to -3pi/4? What if your life depended on the outcome and one choice said “significant” and the other said “not significant”?

      • I ran across a lesser known nugget in the history of statistics while doing paleomagnetism. It’s a long story we can talk about sometime, but basically Fisher created a normal-like distribution on the sphere (i.e. directions for the magnetic field) and created some test statistics for it. I think he did it in the 50s. Anyway, it’s obvious crud for a variety of reasons and the paleomagnetists have long since moved on.

        • Not always “entirely.” In many problems some regions are more natural than others. In this case, if the wrap-around cases can be distinguished from the non-wrapped cases then the distances are different. A person who sails around the world and back to the same dock is different from a person who sails around the harbor and back to the dock.

      • Yes, but that can be sorta avoided due to the fact that they’re so far apart (like Pakistan and Bangladesh becoming separate countries). Once you get more topologically complicated spaces like a figure 8 or a torus or something it’s harder to sweep the issue under the rug.

        The issue is that p-values are only serviceable in some simpler problems because there’s a monotonic relationship between the probability of the data and the tail area p-value. So using the p-value is in effect equivalent to using the probability of the data. If you continue trying to use the p-value beyond those simple cases, you start running into absurdities.

        • It already becomes problematic in bimodal distributions. Suppose you have two typical situations in your experiment… values are near 0 +- 1 or values are near 10 +- 1 two little normal distribution bumps… now you get a data point 5 and want to know if this indicates a violation of your usual observations… the right tail area is about 0.5, the left tail area is about 0.5, the confidence interval for results is from about -2 to 12, but the probability to be anywhere between 3 and 7 is basically 0… so the p value completely fails as a measure of anything

        • Anonymous, I don’t see how this is a hard problem. As Jeff Walker’s elevator pitch mentioned, you need a null distribution for the test statistic, in which case (as David P points out) it’s just boils down to 1-sided or 2-sided.

          Even a more complex object shouldn’t be a problem if you have a probability measure and a distance metric for the object.

          In this more general case, you would only have something analogous to a two-sided. p-value would be defined as integral of probability measure over region of the object as far or further from null hypothesis than your test-statistic.

          What am I missing?

        • Hang on, my definition wasn’t correct; I had the multi-modal case in my head but it make it through my fingers; a p-value could be defined as integral over the object where the probability measure was equal to or less than the value of your test-statistic.

        • Haste makes waste. Let me restate the whole thing.

          I put it to you that a more complex object shouldn’t be a problem if you have a probability measure and a distance metric for the object.

          In this more general case, you would only have something analogous to a two-sided. A p-value could be defined as integral over the object where the probability measure was equal to or less than the value of the probability measure of your test-statistic.

          But I must be missing something?

        • David, suppose our probability measure is p(theta) where theta is the angle counterclockwise from the horizontal position and goes from -pi to pi, along the lines of normal math measuring of angles in trigonometry.

          Now as Anon says p(pi/2) (straight up, 12 o’clock etc) is the highest density region for observations.

          Something jostles your apparatus and you want to know if it’s still well calibrated. So you take a data measurement, it gives angle -pi/4

          please using the notation

          p = integrate(function,variablename,lower,upper)

          and plugging in whatever functions variables and ranges are appropriate, describe to me how you can calculate a p value for the hypothesis “this data point was generated by the same process as we saw before the apparatus was jostled” *using the angle itself as the test statistic*, and why your choice is the unique obviously correct answer.

          Yes, you can design other test statistics, that isn’t in question. In fact using the density as a test statistic is a method I’ve advocated here for checking the adequacy of Bayesian models.

    • “What if your life depended on the outcome and one choice said “significant” and the other said “not significant”?”

      This can be applied to Bayes factors, posterior probs, any statistic you have a strict cutoff on… so it is not a good criticism against p-values IMO..
      Would I “bet my life” on any single outcome, no matter how small the p-value? Probably not. But what if 500 studies showed “significant”? Well, that could be a different story (assuming sound experiments and no QRPs)

      “P-values are fun to think about. Suppose you have a test statistic T defined on the unit circle. The null hypothesis is that the parameter mu is at the top of the circle (the point at angle pi/2). Under this null hypothesis, the distribution of the test statistic is max at the top and decrease as you go further down, and is smallest at the bottom (angle -pi/2).

      Now the value of the test statistic T(data) is the point at -pi/4. Given this, how do you compute the p-value?

      Do you include just the region from -pi/4 to -pi/2? or do you include -pi/4 to -3pi/4?”

      What possible values can T take on?

      I know there are fields of wrapped and circular (and spherical, etc.) distributions, for example by Fisher and von Mises. I don’t know much about them, however.

      Justin

      • Also, as pointed out I am familiar with the von Mises-Fisher distribution since it was created for work in paleomagetism (study of the ancient magnetic fields from magnetism frozen in rocks at formation) and I’ve worked in paleomagnetism for a bit.

        As to your point about cutoffs, making a decision using a cutoffs is in effect making an approximation beyond anything warranted by the evidence/assumptions. As such it can lead to inherent problems no matter who does it.

        But two points: (1) A bayesian understanding shows when the approximation is going to be a good one and hence acceptable.
        (2) bayesians can just avoid the approximation by looking at the entire posterior for the parameter and using it in whatever analysis follows.

        what do p-value-nistas got?

  12. Can I suggest that for nonstatisticians, the implications for interpretation are more relevant than the formal definition, and the interpretations are routinely in error.

    The frequentist implementation of statistical procedures produce two measurements – the test statistic, a measure of magnitude, and a measure of the precision of the estimate. If the measure of magnitude is close to the chosen null (and close can be defined as too small to be material based on the issue being analyzed), it doesn’t matter whether we can reject the null or not. Our best estimate is the effect is small. Even if the precision is high, in part because the sample is large, and the standard error is such that a strict test of magnitude against the null leads to the rejection of the null, the effect is still small. One of the errors of interpretation is that the result, being significant, is important, or material, or strong. We see this in clinical research all the time.

    If the measure of magnitude is large, i.e. material, and the precision is high, the p-value is low and the null is rejected. The point estimate will be taken as the best measure of the effect, and that estimate of the effect will be reported. Unless we are in the large amount of really bad research which ignores magnitudes, reports sign and p-value, and treats small p-values as evidence of a strong effect. Also seen in much clinical research.

    If the measure of magnitude is large, but the precision is low, in part because of small samples, the p-value may not rise to “significance.” Many possible misinterpretations here, the most common that because we can’t reject the null, we should assert the relationship is zero. The problem of imprecision is never formally addressed or acknowledged, but there are many cases where what would be material associations are simply treated as zero because the null can’t be rejected.

    • My problem is you should be studying the relationship between variables, not for “an effect”. Eg, in medicine this would often be a dose response curve, or response over time under various conditions in various patients (and the average often does not look like the individual curves!). Then you come up with a model to explain the shape of the curves and use it to make predictions. No one can come up with a useful model to explain “an effect”.

      The entire practice of looking for “an effect” to begin with is what needs to go away. Adding textbooks worth of generic and pedantic math on top of that just serves to hide what is going on.

  13. I think the initial post by Keith O’Rourke gives a way too uncharitable view of what Jennifer Rogers is saying about p-values in that presentation.

    Here’s a less critical view: She’s right. Yes, one could find a number of reasons for criticizing it as not technically correct. However, let’s say, for argument’s sake, that the study, or studies, we are talking about were correctly designed to do just that: Tell us if some biologically vetted, causal factor is associated with a higher cancer frequency. This would mean that the selection was done right, systematic error was eliminated, a clinically meaningful effect size as well as a reasonable significance level was decided upon, a specific and proper power size was settled, etc.

    Then the p-value is there to tell us what to do: Discard the involvement of the factor in cancer, or (and this should not be controversial at all) ACCEPT the factor as causing cancer.

    All right, she says “*definitely* does or *definitely* does not”, but come on…

    I get a little bit tired of “what the p-value really means, is…” type comments. The definition of p-value is basically a mathemathical one. But statistics is an *applied* science. What is the p-value supposed to mean in the context of a *study*? Well, it means what she said.

    • HP:

      I disagree with your claim that the statement, “Statistical significance just tells us whether or not something definitely does or definitely doesn’t cause cancer,” is right. This statement is wrong. By wrong, I don’t just mean it’s “not technically correct,” nor do I mean that it would be correct if you remove the word “definitely.” I actually mean that it’s a wrong statement and that it is misleading. Indeed, it’s a misconception that causes big problems in applied statistics.

      You ask, “What is the p-value supposed to mean in the context of a *study*? Well, it means what she said.” No, it doesn’t. Yes, researchers and reporters often act as if statistical significance tells you whether an effect is real or not, but the problem is that in real life it doesn’t. In real life it often happens that statistically significant differences are found that do not replicate. I agree with you that statistics is an applied science. That’s why we get concerned when numbers are used to make strong and incorrect conclusions. Bem’s ESP study is one of zillions. That’s why we’ve been talking about the replication crisis. See for example this paper by Kate Button et al. from 2013 for one of many discussions of the topic.

      P.S. Again, as Keith wrote, these sorts of errors are easy to make. Right at the beginning of one of my published papers, the p-value is defined “the probability that a perceived result is actually the result of random variation.” That’s wrong. It’s not just “not technically correct,” it’s wrong. Statistics is hard, it’s easy to make wrong and misleading statements, and it’s good for us to correct these errors.

      • I do not make a general claim about that statement. I am talking about it in the context of her talk and I even draw up a scenario where it would be correct to isolate the p-value as *the* defining factor for decision-making. And a key factor in that scenario: a “biologically vetted, causal factor”. That is, a very sound alternative to the null hypothesis, which all good studies have. So ESP studies isn’t really relevant to the point I am making.

        Of course, unsound studies with a very tentative H1 sometimes end up with small p-value calculations. But they should first and foremost be criticised for being bad studies. We do not know what studies Rogers is talking about. What if they are really good ones?
        So, again, if one chooses to understand Rogers as saying that “we should ALWAYS start our decision-making by ONLY looking at the p-value” or if one would claim that this is the message that the audience hears, I guess I just don’t hear or see it.

        By the way, could I ask you, or anyone who feels Rogers’ statement is very problematic: Could you frame briefly how you want the p-value of a good study to be *used*. I do not mean what it *is* or one of the many examples of what it should *not* be used for.

        • There are two good uses for p values that I’ve been able to figure out in applied studies.

          1) to decide if a particular data point should be studied as unusual compared to a large database of past “usual” events. Like for example if a seismometer day in and day out has low level vibrations and suddenly something with vibratory magnitude very high p=0.0028 relative to the past database occurs.

          2) When you have a theoretically derived model and want to show that things that happened in the past are compatible with your model, so that p=0.37 is for example taken to show that your model can’t be clearly falsified by this data.

          that’s it.

        • Because 1) and 2) mostly talk about single studies, I’d add to that list:

          3) meta analysis, or looking at p-values from similar repeated studies, the ‘whole’ of the evidence

          Justin

      • Holy cow, I just realised whose blog this is (I came straight here from a Twitter link)!

        So I guess you’re not the right person to ask for reflections on the usefulness of p values… :-)

        • HP:

          Of course I’m the right person to ask. Indeed, I published a paper a few years ago, P-values and statistical practice, which directly addresses the question.

          The statement, “Statistical significance just tells us whether or not something definitely does or definitely doesn’t cause cancer,” is wrong. That doesn’t mean that p-value or statistical significance give no information. But it’s a mistake to use them to decide that an effect exists or does not.

        • Indeed you are, and I absolutely agree with your point. Let me explain: I engaged here in a mode thinking I was discussing with frequentists and I had an idea to pursue the discussion until it possibly brought out some of the weak aspects of the concept of statistical significance, which I find interesting. But when both parties are NHST skeptics, as I realised is probably the case, that won’t work. So I’ll jump back in when there’s a thread dealing more specifically with that I guess. :-)

  14. It’s amusing to see the same old discussions rehashed ad-nauseam as we go into 2020. The discussion doesn’t progress there hasn’t been a clear admission from statisticians, scientists, and especially frequentists, that Fisher, Neyman, and Pearson made a massive mistake that lead science down a century long rabbit hole of absurdities.

    They tried to generalize an entire system of statistics from simple cases where any reasonable statistical philosophy gives the same answers. That would be fine, but they based it on the wrong statistical philosophy (probabilities = frequencies), and got the generalization wrong.

    Specifically,

    (1) the tail area p-value only really works when it’s a monotonic function of the probability of the data. The probability of the data is fundamental. From a bayesian perspective, if the observed data sits low in the probability density of a model, then the model can easily be beat by even low initial probability challenger models. So it serves as a warning to find those better models. You don’t need p-values to do this. Gelman for example has a bunch of better and more general ways to check models.

    (2) Confidence Intervals aren’t a range of plausible values like everyone thinks they are. The entire confidence interval in real problems can in fact be impossible values.

    (3) When you do a frequentist significance test and accept the null, you’re in effect doing the following: the evidence/assumptions say there’s a range of plausible values for a parameter, but you’re going to reduce that range to single point (the null). This can be fine if the original range of plausible values was narrowly concentrated around the null, but in the vast majority of cases of significance testing in the wild, the range of plausible values is large. So replacing it with a single point leads to massive errors.

    Note all of this is true even if all the other problems (definitions not being taught right, frequency histograms approximating a probability distribution upon infinite repeated trials, assumptions being wrong, and so on) aren’t there. Even accepting the best case frequentist scenario, it’s a disaster.

    Fisher, Neyman, and Pearson just got it wrong! It’s just wrong. It doesn’t mean you can never use one of their methods. It doesn’t mean every paper that uses their methods reached a wrong conclusion. But it does mean the whole edifice is fundamentally flawed.

    But if you’re not willing to accept that, can we at least agree to stop making the absurd claim that frequentist statistics lead a big chunk of modern science into disaster because p-values are just sooooooooooo much harder to teach right than any other concept!

Leave a Reply to Anonymous Cancel reply

Your email address will not be published. Required fields are marked *