The statistical significance filter leads to overoptimistic expectations of replicability

Shravan Vasishth, Daniela Mertzen, Lena Jäger, et al. write:

Treating a result as publishable just because the p-value is less than 0.05 leads to overoptimistic expectations of replicability. These overoptimistic expectations arise due to Type M(agnitude) error: when underpowered studies yield significant results, effect size estimates are guaranteed to be exaggerated and noisy. These effects get published, leading to an overconfident belief in replicability. We demonstrate the adverse consequences of this statistical significance filter by conducting six direct replication attempts (168 participants in total) of published results from a recent paper. We show that the published claims are so noisy that even non-significant results are fully compatible with them. We also demonstrate the contrast between such small-sample studies and a larger-sample study (100 participants); the latter generally yields less noisy estimates but also a smaller effect size, which looks less compelling but is more realistic. We make several suggestions for improving best practices in psycholinguistics and related areas.

Shravan asks all of you for a favor:

Can we get some reactions from the sophisticated community that reads your blog? I still have a month to submit and wanted to get a feel for what the strongest objections can be.

243 thoughts on “The statistical significance filter leads to overoptimistic expectations of replicability

    • “Doesn’t “et al” mean “and others” (plural)? There only seems to be the one other author…”

      Yes, i think you are correct.

      I think the APA rules (APA Publication Manual, 6th ed., page 177) state that when first refering to a paper with 4 authors in the text one should list all 4 authors. When refering to that same papers a 2nd time, one should use the 1st name + et al. It seems that in all instances concerning in text references and the possible no. of authors and rules concerning using “et al.”, the “et al.” part then refers to more than 1 person.

      If this is correct, and professor Gelman wanted to adhere to the APA guidelines in this case, he should have listed the 4th author instead of “et al.” (and added the year). Even without wanting to adhere to the APA guidelines, i think “et al.” should indeed still at least refer to more than 1 person. So, without wanting to adhere to the APA guidlines, he should perhaps have written “Vasishth, et al.”, or “Vasishth, Mertzen, et al.”.

      Perhaps professor Gelman may have tried to do 2 things at the same time: list all authors, but leave himself out as the 4th author. Or he may have been feeling rebellious against the APA Publication Manual which tells people how to refer to scientific papers. Perhaps we will never know the true origins of this possible erroneous useage of “et al.”. However, if it’s a form of rebellion against the APA Publication Manual i salute him: stick it to the man!

        • “Maybe he was using it as an abbreviation for et alius rather than et alii.”

          Thanks for the comment/correction!

          I even looked up “et al.” before posting my previous comment, and it did look like “et al.” only refers to more than 1 person (https://en.wiktionary.org/wiki/et_al.).

          That wiki-link states “et al.” could be an abbreviation for “et alia”, “et aliae”, “et alii”. And it lists these terms, and “et alios”, as related terms which i all checked and interpreted to also refer to more than1 person.

          I now wonder why “et alius” is not included in that wiki page.

      • Actually, APA says that if there are 5 authors, all should be reported, and thereafter, “1st author et al.” If there are 6 or more authors, just “1st author et al.”

        Indeed the literal meaning of “et al.” implies more than one author. In practice, the convention is that it applies to any number of authors that are not mentioned, even one.

  1. I think my main take-away from reading the paper is that it’s really two papers. It’s a broad paper on replicability and ‘the statistical significance filter’ (i.e., the title) and simultaneously a very focused paper on psycholinguistics. If the aim is for the paper to follow the title and abstract, I would cut a lot of the linguistics content. If the aim is for this to go to a psych journal I would rework the content somehow, maybe put more of the statistics portions into appendices or pointers via references.

    I don’t have any ‘strong objections’ beyond focus, but I’m not a psycholinguist.

  2. Sorry, but isn’t this obvious?

    That said, I guess the repetition is still useful because the scientific community still hasn’t dealt sufficiently with the consequences of this problem.

    • I think it may help to have different versions of this argument appear in different sub-field-specific venues. This kind of thing is obvious to, e.g., readers of this blog, but it may well be that it’s less obvious to (most) psycholinguists.

  3. A p-value less than .05 has nothing to do with replicability.

    It means, if everything was done correctly, and all results are reported, the long-run probability of making a sign error is no greater than 5%.

    If you want to talk about replicability, you need to consider the type-II error probability and power; A study with 6% power has only a 65 chance of producing a signifcant result again (successful replication). A study with 99.9% power has a nearly perfect chance of producing a significant result again. p < .05 doesn't tell you how much power the study had to get p < .05. Nothing does. No Bayesian magic can tell you that either.

    The best you could do would be to compute a confidence interval around the effect size (or if you are confident in constructing reasonable priors a credibility interval). You can then see what power a replication study had with the loweer limit of the CI. Often you can spare yourself this exercise because the lower limit will often imply very low power. In other words, you just don't know whether you have 5% or 90% replicability. Unless your p-value is really small, p < .0001. In that case, you have high replicability.

    This is all rather trivial application of Neyman-Pearson's approach to statistical inferences. Bayesians may prefer to ignore it, but then they shouldn't talk so much about p < .05, which is only a reasonable criterion in the NP framework.

    • Sorry 6% became 65 which is very confusing. 6% power mean the probability of obtaining a significant result in an exact replication study is 6% (not 65).

    • A p-value less than .05 has nothing to do with replicability…. a signifcant result again (successful replication)

      Nothing aside from defining what counts as a result in the first place or what counts as a successful replication, anyway.

    • Less snarkily, we don’t know the true effect size. If the statistical significance filter systematically exaggerates effect size estimates, and if samples sizes for replications are determined, in part, by exaggerated effect size estimates, then a p value less than .05 will have something to do with replicability.

      • This is exactly right. P(significant result) will rise from alpha at d = 0 to 1-beta at d = x (where ‘x’ is whatever the authors based their power calculations on). The consequence is a lot of significant results that are underpowered, overestimated, and hence difficult to replicate, even though they might look pretty on paper because they are overestimates.

        This characteristic of N-P testing represents a serious flaw with their approach imo because that approach assumes a dichotomous world in which effects are either 0 or whatever one bases their power calcs on. The population of effect sizes among variables is generally continuous in nature (with some exceptions), and the so-called ‘error control’ of NP doesn’t take that into account. In short, it’s a horrible method any time the true effect lies in-between the Ha and Ho.

        The more times and more ways that message can be gotten out the better.

    • Ulrich:

      You write, “Bayesians . . . shouldn’t talk so much about p < .05.”

      It happens that I don’t just analyze data; I also spend time trying to understand published work of others, for example that claim that early childhood intervention increased earnings by 42% (see section 2.1 of this paper).

      I think just about nobody would’ve taken this claim seriously had it not been accompanied by the magic “p less than .05.”

      • Andrew – you forgot the easy fix to the problem: change the numbers in the published version to something somewhat more believable (that also happens to be statistically significant).

        Working Paper Version: “Stimulation increased the average earnings of participants by 42 percent.”

        http://www.nber.org/papers/w19185

        Version published in Science: “…the intervention increased earnings by 25%”

        https://www.ncbi.nlm.nih.gov/pubmed/24876490

        I actually think both numbers are in both papers (or very similar ones), they just changed which estimate they wanted to highlight. But it is funny to me that the 42% sticks in your head even though the headline published result is only 25%. Which change possibly occurred in part because you were such a jerk about it on the internet. Which is progress, of a sort, for science…I guess.

        • Jrc:

          I contacted the first author of this paper three different times, including twice through intermediaries, but neither he nor any of the other authors responded to my questions in any way. And I doubt the published article refers to any of my writings on the topic. So if they changed their estimate from 42% to 25%, I don’t think I deserve any credit!

      • Andrew,

        there is a simple way to criticize the reporting of this. The 42% number is a not an effect size. It is a statistical parameter (probably a regression coefficient). There is a long way form this statistical parameter to claims about a generalizable inference about the magnitude of a cause-effect relationship.

        First, the point estimate in a sample is a noisy measure of the population parameter (which still is only an estimate of the magnitude of the effect).

        If this 42% estimate is obtained in a small sample, the sampling error is large and the 95% confidence interval is wide. So, a correct report of the result would be that the population parameter is somewhere between 5% and 95%. The 42% point estimate is meaningless and should be ignored.

        Whether even this interval is wrong and the correct interval would include 0 because of the significance filter depends on whether the authors used a significance filter or not. If they did a preregistered study and this is the only test of this particular relationship based on some theoretical prediction, there is no filter and no adjustment is needed. If they did an exploratory regression with 10 predictors, standard adjustments apply and the CI would be wider.

        Finally, reporting 42% as an effect size has nothing to do with significance testing, unless we just all all bad practices significance testing, but that would be unfair to those who use significance testing correctly.

        • Ulrich:

          42% was an estimated regression coefficient; the corresponding 95% interval was something like [2%, 82%]; not really because of transformations but that’s the basic idea. Under certain assumptions, which were not close to being satisfied in this example, the coefficient estimate (in this case, 42%) is an unbiased estimate of the effect size. I agree with you that the estimate is pretty useless as is. But I also don’t see anything special about the endpoints of the interval. In a (hypothetical) large-sample preregistered replication, I could well imagine an effect of 2% or -2%, but I’d be gobsmacked to see 82% or, for that matter, 42%.

          And, yes, the reported estimate of 42% has everything to do with significance testing, as it would not have been reported had it not been associated with p less than .05.

  4. I will voice a generic observation on many articles not unlike this in the broadly defined psych methods literature which do not seem to build on quite similar technical work that has preceded it. I am not commenting at all on the “case study” aspect of this paper, which I like and think is sometimes the only way that folks in a particular field can be made to pay attention. But the methodologic elements have been quite thoroughly explored before, making virtually the identical point here, using quite similar language. While it is almost certainly true that psycholinguists may not be familiar with this literature, it is the job of methodologists writing for them to make them aware, not create the illusion of more novelty than is merited, and to take advantage of the technical work that has preceded it. Showing that this problem has been discussed over, and over, and over again in the non-psych literature can add weight to this, making people aware of how much they are unaware of, and that this is essentially a settled issue. A small sampling of this literature, uncited here, is below. You can decide if any of these are helpful.

    Goodman, S. N. (1992), “A Comment on Replication, P-Values and Evidence,” Statistics in Medicine, 11, 875–879.
    Dennis D. Boos & Leonard A. Stefanski (2011) P-Value Precision and Reproducibility, The American Statistician, 65:4, 213-221, DOI: 10.1198/tas.2011.10129
    Halsey LG, Curran-Everett D, Vowler SL, Drummond GB. The fickle P value generates irreproducible results. (2015) Nat Methods. 2015 Mar;12(3):179-85.
    Lazzeroni LC, Lu Y, Belitskaya-Lévy I. Solutions for quantifying P-value uncertainty and replication power. Nat Methods. 2016 13(2):107-8.

    The abstract of the (my) 1992 article is not dissimilar to this one’s:

    Abstract
    It is conventionally thought that a small p-value confers high credibility on the observed alternative hypothesis, and that a repetition of the same experiment will have a high probability of resulting again in statistical significance. It is shown that if the observed difference is the true one, the probability of repeating a statistically significant result, the ‘replication probability’, is substantially lower than expected. The reason for this is a mistake that generates other seeming paradoxes: the interpretation of the post-trial p-value in the same way as the pre-trial alpha error. The replication probability can be used as a frequentist counterpart of Bayesian and likelihood methods to show that p-values overstate the evidence against the null hypothesis.

    • Steve:

      Thanks for the references. We did not intend to “create the illusion of more novelty than is merited.” If we were not clear in our paper that the statistical ideas therein were not new, then we did a bad job in communication. We’ll take a look and make sure that we are not claiming novelty in our paper.

      Regarding your abstract, I would just comment on your statement, “if the observed difference is the true one.” I do not think it makes sense in general to use the observed difference as an estimate of the true or population difference, and certainly not under selection for statistical significance.

    • Thanks Steve. I thought that it would be clear that we are covering old ground when we wrote in the first paragraph:

      “We will demonstrate through direct replication attempts that one adverse consequence of the statistical significance filter is that it leads to findings that are positively biased (Gelman, 2018; Lane & Dunlap, 1978).”

      We were also trying to avoid a citation salad (a long list of citations that often accompanies a claim). This is (ironically) a recommendation by Sternberg, on how to write psych articles, IIRC. Or was it Bem? They wrote a bunch of articles on this topic (of writing) once.

      But we will add a few sentences early in the paper making it clear that this topic has been discussed over and over and over again. It’s a puzzle to me why psych* never got the message.

      In our defence, I feel that this paper is still useful because we took two+ years to actually try to reproduce the significant results in the original paper. Making mathematical arguments or using simulation to demonstrate the point is very important, but actually seeing the (to you) obvious point in action will be shocking to many, I can assure you.

      We will read your papers! Thanks for the references.

        • Here, those adjectives seem biased to psychology authors rather than earlier statistical authors – though maybe that’s just my biased perspective ;-)

        • Hi Keith; it cuts both ways. How many psychologists do *you* cite? ;) Everyone cites work mainly from their own field. The problem is that a psychologist is not going to read Statistics in Medicine. And a medical statistician won’t read Psychological Methods either.

        • > How many psychologists do *you* cite?
          Without actually checking, I have or should have if the topic involved psychology. Now if a statistician wrote about the psychology I would be comfortable citing them, but I also should cite a source from psychology.

          Otherwise, its just poor scholarship – why deprive your readers (or yourself) of the originating sources?

        • Hi Keith, you wrote:

          “those adjectives seem biased to psychology authors rather than earlier statistical authors”

          Where in the paper is there a reference to a psychology author that should have been to an earlier statistical author? Could you point me to the paragraph(s) in the paper?

        • “why deprive your readers (or yourself) of the originating sources?”

          This is a very interesting question that I have been pondering for the last 10 or so years, and I think I have some answers, and they don’t have to do with bad scholarship. I’ll give some examples.

          1. I was briefly a member of the American Statistical Association at one point, and Rao (the statistician) was interviewed once in their magazine. He mentioned in the interview that a statistical law or rule he had discovered ended up being named after an American statistician by an author of a paper because (the author explained to Rao), Rao’s name was too difficult to write or pronounce. Another example: Levenshtein distance is often called edit distance, and a tutorial somewhere on the internet explains why: “If you can’t spell or pronounce Levenshtein, the metric is also sometimes called edit distance.” So Vladimir’s contribution was to science was lost in textbooks. For example, in the classic computer science textbook on algorithms, “Algorithms”, by Cormen, Liverson, Rivest, they (in my old edition from the early 2000s at least) have section 16-1 on Euclidean distance, 16-2 on Edit distance, and 16-3 on the Viterbi algorithm. This is not poor scholarship but FOPUN, Fear of Pronouncing Unpronounceable Names.

          So, I now propose Vasishth’s First Law, the Law of FOPUN: “If the name of a discoverer of a law or algorithm is unpronounceable, attribute it to someone else, or give it a simpler name.”

          I believe that Stigler has a whole section in Statistics on the Table (The history of statistical concepts and methods). I even found a list: here.

          Is the latter bad scholarship on statisticians part? Maybe, but it could have other explanations. I forgot now how Stigler explained it in his book and can’t find the place where he discussed it.

          2. In 2009, this paper appeared in Behavioral Ecology:

          Schielzeth, H., and Forstmeier, W. (2009). Conclusions beyond support: Overconfident estimates in mixed models. Behavioral Ecology, 20, 416–420. http://dx.doi.org/10.1093/beheco/arn145.

          Almost nobody in psycholinguistics reads or cites it, of course, and I believe nobody in statistics either. For example, Stroup wrote an amazing book, a statistical textbook called Generalized Linear Mixed Models: Modern Concepts, Methods and Applications, in which he warns the reader to exactly *not* do what Schielzeth et al advise. But he doesn’t cite the above paper. Why not? Not bad scholarship, but rather because statisticians don’t read outside their discipline that much *when thinking about about statistical issues*. That is what I was referring to when I asked you, Keith, when was the last time you cited a psychologist
          *writing on statistics*.

          The story gets funnier. This paper appears in 2013:

          Barr, D. J., Levy, R., Scheepers, C., and Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68, 255–278. http://dx.doi.org/ 10.1016/j.jml.2012.11.001.

          It makes basically the same point as Schielzeth et al., and even (to their great credit) cites it. However, subsequent discussions of the point in psycholinguistics that Schielzeth et al. made has been attributed to Barr et al. Barr et al. have over 2000 citations, and poor Schielzeth et al have a mere 370, almost all within ecology and closely related areas. The Barr et al paper became a citation classic, and the people who first discussed it (I am having difficulty recalling the names of the authors even two lines down from having cited them, Vasishth’s first law in action) are gone from history.

          An even more amusing example is this: in 2008, Florian Jaeger, a friend of mine (a psycholinguist with statistical expertise), wrote this paper, which garnered 2000+ citations, another citation classic in our field:

          @article{jaeger2008categorical,
          title={Categorical data analysis: Away from ANOVAs (transformation or not) and towards logit mixed models},
          author={Jaeger, T Florian},
          journal={Journal of memory and language},
          volume={59},
          number={4},
          pages={434–446},
          year={2008},
          publisher={Elsevier}
          }

          Almost all subsequent adoption of logistic mixed effects models in psycholinguistics cites this paper, and doesn’t even give any credit to the person who actually wrote the glmer software: Douglas Bates. Bates’ paper could have been cited, but *no* psycholinguist writes: “we used logistic mixed effects regression (Bates et al., 200x) (there are many papers), rather they write
          “we used logistic mixed effects regression (Jaeger 2008)”. What the heck? If I were Bates (and not a day goes by when I wish I wasn’t a younger version of Bates), I would be really pissed that I do all the work of writing the code and someone who wrote a tutorial on it gets the credit and I never get cited!

          This phenomenon has the character of a law, which I call

          Vasishth’s Second Law of Citation Tunnel Vision: Given a choice between citing someone outside your field and within your field, cite the paper within your field.

          There is a third law:

          Vasishth’s Third Law, The Sternberg Way: Given a choice between citing someone else and yourself, cite yourself.

          Note that I provided convenient alternative names to my laws so nobody has to write my unpronounceable name, an instance of FOPUN. Note that I didn’t do any research either to see if anyone else has already discovered these laws, so please correct me if they have already been discovered.

          3. Last example: CS Pierce was many things, but he was also a psychologist. He rediscovered randomization, something Fisher discovered, in the course of setting the groundwork for experimental work in psychology. This is discussed in a very entertaining chapter (ch 10) in Stigler’s book, Statistics on the table, which every student of the history of statistics should read (I mean the book). It seems it’s pretty common to not know what’s going on outside one’s own immediate field. I doubt very much that CS Pierce was a bad scholar.

    • You are a medical statistician right? I always wanted to know: why has medicine not adopted Bayesian methods? Just a few weeks ago I was at a meeting on transplantation and all I saw was $p0.05$. It’s 2018.

      • “I always wanted to know: why has medicine not adopted Bayesian methods?”

        A number of Bayesian statisticians have tried to introduce medical researchers to Bayesian methods, but from what they say, it is hard to convince the medical researchers to switch from what they are used to. But there definitely is literature on the subject (e.g., https://www.crcpress.com/Bayesian-Adaptive-Methods-for-Clinical-Trials/Berry-Carlin-Lee-Muller/p/book/9781439825488)

        • I read the key sections of this book last night. One phenomenal sentence in the book:

          A clinical trial should be like life: experiment until you achieve your objective, or until you learn that your objective is not worth pursuing.

          page 12.

      • In the pharmaceutical industry there’s two different worlds:

        1. In early trials (mostly for deciding whether projects proceed and sometimes what dose to use) Bayesian methods have gained a lot of ground (in Oncology dose escalation experiments, as well as in non-Oncology proof-of-concept trials that more and more use some informative priors based on historical data).

        2. Late-stage trials (confirmatory trials that are meant to lead to an approval) are much more (or essentially exclusively) frequentist. This is based on the not entirely unreasonable argument of that all the pre-clinical research + early trials must mean the prior probability of the company’s drug working is decent (let’s say 60% or so), so if the company runs two trials with 80 to 90% power (for the smallest effect size that is meaningful or something not much bigger) that need to achieve p 0). There is a bit of a feedback loop between regulators that consider what I just outlined as a proven conservative approach and companies that do not want to risk multi-multi-million dollar investments on some “unproven” approach (so that they usually don’t even ask about Bayesian methods). And I assure you, it is not because statisticians in the industry are not aware that Bayesian methods might be useful.

        For some reason non-Oncology dose finding trials are also primarily conducted under this second paradigm, but if you want to confirm that a medical device works (they are reviewed by a different FDA division), it is much more common to use Bayesian approaches.

        Where does this leave non-industry trials? I have a lot less knowledge there, but I speculate that what happens with (2) in drug approvals and how past large public (e.g. NIH) trials were done has somehow created the perception that when you “objectively scientifically prove” something you follow approach (2). And of course there are plenty of incentives to “prove something” with a tiny inappropriate sample size (possibly also with a little wander down the garden of forking paths) and get a big splash publication (=fame, tenure, future funding etc.). And of course, a publication at most journals is most easily achieved with p<=0.05 (if you are lucky, if you have not done anything weird and funny like Bayesian statistics, you might not even get a statistical reviewer – who are known to ask awkward difficult questions).

        • This is a very valuable summary of the current situation, Björn. Outsiders like me keep wondering what’s going on.

        • Thanks for this. In paradigm 2, because of the winner’s curse, it seems to me that pharma companies ought to be powering their Phase 3 trials to detect a smaller effect than the one they actually observed in Phase 2. Do they do this? Is there any kind of statistical machinery that could help determine how much smaller they should go? I think about this every time I see dire reports on how many Phase 3 clinical trials fail to meet their endpoints.

          I also wonder whether regulators consider effect size at all in deciding whether to approve a drug. Like, if you correctly power your study to find a tiny effect, and you adjust your alpha for any looks at the midpoint, and you still see statistical significance, is that all the FDA cares about? Will they approve a drug that we are pretty confident adds 6+/-1 days to a person’s life expectancy?

        • Erin

          In paradigm 2, pharma firms should using thoughtful weight of evidence and decision analysis. Now such an analysis might support betting on a tiny or large effect at very low or high power depending on the economics.

          As for whether regulators consider effect size at all in deciding whether to approve a drug, it depends on the country.

          Regulators have to follow the laws that give them their regulatory powers and for instance in Canada approval is at the federal level and the requirement I believe is just to have some benefit over placebo with something like reasonable risk of side effects. Its up the provinces to decide if the drug is worth paying for any of the patients they will cover and physicians for their patients that might be willing to pay for it.

          The FDA should provide this type of information on their website.

        • “In paradigm 2, because of the winner’s curse, it seems to me that pharma companies ought to be powering their Phase 3 trials to detect a smaller effect than the one they actually observed in Phase 2.”

          The issue with this is that in many cases, there may be an positive effect that’s less than what’s hypothesized, but not enough for final approval to take a drug to market. In many such cases, showing a positive effect is *not* enough to bring a drug to market. For example, in clinical trials for generics, it’s my understanding that you need show the lower bound on your CI is greater than some fraction of the label brand drug’s effect (i.e., if the effect of the label drug is 1, your lower bound on your CI must be 0.8 or something like that). This can lead to an odd result when your trial ends with a CI of (0.7, 1.1): you don’t get approval and it’s not clear that you will if you up your sample size.

          I think this answers your second question: at the very least, you are required to make a strong argument to the FDA why your effect size target is clinically significant, not just statistically significant. I think for generics there’s already a standard (i.e., relative effect compared to label brands), but I’m not clear on the details there.

          Similar to what Bjorn has said, I think Bayesian methods make lots of sense for early trials (both because data is expensive and we are using surrogate models, so we need to recognize that we’re definitely not taking samples of the population of interest), but Frequentist methods are very important for Phase III trials, as the high rate of failure for Phase III trials is strong evidence that the prior information justifying Phase III trials appears to be mis-calibrated with high frequency.

      • ” I always wanted to know: why has medicine not adopted Bayesian methods”

        Because we, in medicine and biology, have not been taught that. Grasping the concept of p-value alone is already hard for us, how do you want us to adopt Bayesian statistics?

        The conundrum is similar to the old adage about p-values: why do teachers teach about p < 0.05? Because that is what the editors require later on. Why do editors require p < 0.05? Because that is what they were taught.

      • Well, I was a medical statistician and the first project I worked on starting in 1985 used Bayesian methods but for evaluation of study design issues rather than analysis of a completed study.

        It’s just the culture was pretty much anti-Bayes and also anti-meta-analysis. Single studies where supposed to be analysed on their own as isolated islands. That made even less sense than anti-Bayes where until about the last 10 years priors really were poorly understood in what should be aimed for and there role. And the fictional Dan Simpson is still creating pieces of those like the fictional Australian continent/country.

        When Don Rubin asked me why I did not use Bayesian methods around 2000 I said I thought I would retire before they became acceptable in medicine. Before leaving the field I did get to do one Bayesian analysis (2008).

        Adaptive designs are a bit more complicated because even if they are Bayesian they can be easily evaluated frequentistically and so there should be not be distrust. However, that only happened about 10 to 15 years ago when computers got fast enough. For instance regulatory agencies became comfortable with them around 2011 given mostly the work done by Don Berry.

        Now there were some early fumbles with adaptive designs in clinical research and then some clinicians got very strange ideas about their performance characteristics and declared them un-ethical. Steve Goodman had to do an awful amount of work to sort that out for the medical community.

        Also, I think Don Berry’s clear arguments that frequency properties of Bayesian methods needed to be consider in drug regulation will help medicine profitably adopt Bayesian methods.

        So overall, I think medicine’s initial rejection of Jay Kadane’s subjective you can’t question the prior Bayes and non-informative Bayes (with _hidden_ nuisance parameter disasters) served the field well – in the past.

      • I’m all for teaching and using Bayesian methods – alongside frequentist methods, and hybrids of the two.
        I find frequentist and Bayes are often best when applied together to balance each other’s excesses and weaknesses. This strategy is a variant on super-learning, but not foolproof (nothing is) because both methodologies share many weaknesses in the form of violations of the data-generating (sampling) model.

        To explain, let me replace your question as: “What are the key weaknesses of Bayesian statistics relative to frequentist statistics, leading to seemingly excessive caution in its adoption for practice?”
        One answer I’d offer is: Good prior specification for (modeling of) parameters is really hard, and bad prior specification can distort inferences as badly as significance testing. To ease the specification task, partial-Bayes (semi-Bayes) allows one to stop the prior-specification process when one runs out of reliable information for that purpose, and revert to frequentist treatments of the unmodeled parameters.

        More cynically, one could also well ask “Why has medicine not adopted frequentist inference, even though everyone presents P-values and hypothesis tests?” My answer is: Because frequentist inference, like Bayesian inference, is not taught. Instead everyone gets taught a misleading pseudo-frequentism: a set of rituals and misinterpretations caricaturing frequentist inference, leading to all kinds of misunderstandings.

        – Understand, I am not a frequentist or Bayesian or any other such school: I regard all schools as toolkits which can be applied or not as seems helpful or not. This notion has been around since the 1970s at least. Yet frequentist vs Bayes is still discussed as if an exclusive and exhaustive choice, even though it isn’t: One can use both in a given problem and even combine them to good effect…and yet still miss important problems both share (like poorly identified misspecifications of the data-generating model, and other violations of shared assumptions), problems which sometimes jump out from basic data descriptions (like finding hundreds of patient records coded as “woman” with “testicular implant,” as once happened to me). Good data analysis needs so much more than abstract theory – it needs immersion in context, and detailed study and data description. I find teaching this a bigger challenge than any other.

    • Goodman SIM 1992 is an excellent article (as is its companion, Goodman SN. P Values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. Am J Epidemiol 1993;137:485–496).

      BUT anyone reading those and concerned with replication should also read Senn’s analysis and comments on Goodman 1992,
      Senn, S.J. (2001), “Two Cheers for P-Values,” Journal of Epidemiology and Biostatistics, 6, 193–204.
      Senn, S.J. (2002), Letter to the Editor re: Goodman 1992, Statistics in Medicine, 21, 2437–2444
      which emphasize that to expect P to be “replicable” or “reliable” reflects a deep misunderstanding of frequentist statistics, because P is specifically constructed to be maximally noisy (uniform) under the hypothesis it tests, given the embedding model (assumption set).

      • I wonder if this comment will inspire another entertaining rant by regular commenter Anonymous, who likes to point out that frequentists have been arguing for ~100 years now that their tenets are deeply misunderstood; maybe it’s time to admit that frequentism is unteachable?

        • Bayesians have been arguing that their tenets are deeply misunderstood for just as long, although those arguments attracted far less attention because Bayesian teaching and analyses have so far not played much of a role in science. And Bayesians have been fighting among themselves for just as long: There is no one Bayes (compare deFinetti to Jeffreys) any more than there is one frequentism (compare Fisher to Neyman).

          In the deFinetti spirit I will wager that Bayes abuse will rise in direct proportion to Bayes adoption to challenge the atrocities of significance testing with its own horrors, especially in confusing inferences driven by priors with inferences driven by data. It’s now been a decade since I confronted an example of that “fooled by priors” phenomenon published by a leading proponent of Bayes in medicine (Greenland, S. 2009. Weaknesses of certain Bayesian methods for meta-analysis: The case of vitamin E and mortality. Clinical Trials, 6, 42-46.).

          I view intimations that “Bayes is our saviour from frequentism” as naive the way claims that “significance testing is our saviour from being fooled by randomness” were a century ago: Any exclusivity or extremism is the road to ruin, however good the intention and compelling the rhetoric. Stats already tried being monolithic with frequentism, must it repeat the mistake with Bayes? Methinks even Bayes would have objected.

        • Sander:

          Yes, Bayesians can be very much fooled by uniform priors. One virtue of classical inference is that it makes weak claims. In contrast, Bayesian inference makes strong claims about everything, hence you have examples like this:

          y ~ normal (theta, 1)
          theta ~ uniform
          y = 1
          ==>
          theta ~ normal (1, 1), hence Pr (theta > 0 | y) = 84%.

          It would be a ridiculous practice, in general, to make statements with 5:1 odds based on data that are indistinguishable from pure noise. But that’s what Bayes with flat priors would have you to. The problem is that flat prior. Or, to put it another way, the statement “It would be a ridiculous practice . . .” codes prior information that theta has a high probability of being near zero.

          This is no worse than classical NHST inference of the sort that is behind so much of the junk science, replication crisis, etc., but it’s no better either. The difference is, perhaps, that researchers are perhaps less trusting of these sorts of posterior probability statements as they are clearly based on assumptions. In contrast, NHST is based on equally ludicrous assumptions, but these assumptions are hidden, and researchers often seem to think that if they have done a randomized experiment, that all the assumptions of the test are automatically true. (And when they do talk about assumps, they tend to focus on irrelevant assumptions such as normality, rather than the important assumptions of validity, large and stable effects, etc.)

          Statistics can and never can be monolithic; the problems we face are too diverse. Even just looking at randomized controlled trials, the problems we face are too diverse.

        • I find the y, theta example a bit odd for a few reasons. One is that it seems like AG is making a NHST argument in earnest.

          But more seriously, suppose one was to replace “theta ~ uniform” with “theta ~ normal(0,2)”. We get a nearly identical posterior, or at least a nearly identical P(theta greater than 0 | y) = 0.81 (compared to 0.84 with uniform prior). I think the normal(0,2) prior wouldn’t seem *too* weak to most researchers. Even with “theta ~ normal(0,1)”, we still get 0.76.

          Personally, I have no problem with the certainty returned, as long as you are certain about the likelihood. But I’m also from the camp that believes you should have a healthy distrust of the appropriateness of the likelihood.

        • P.S. I’m making a frequentist argument here, not a NHST argument.

          Frequentist argument: looking at long-term frequency properties of a statistical method, given certain assumptions.

          NHST argument: proposing to make a decision based on rejection or non-rejection of a hypothesis test.

        • Andrew:
          I should have made clear that the example I took to task in the 2009 Clinical Trials citation was the opposite of yours, in that they were led to claim “no effect” of a treatment because they used a prior with a 50% spike on the null, when in fact the data pointed toward a harmful effect (and for decades harm had been suspected on biochemical grounds by some subject-matter experts). The prior gave enormous weight to a null that had no basis in empirical observations or credible biochemical theories (all of which pointed toward an effect in one direction or another), and that weight passed through to the posterior (as spikes must). Basically, they made the same sort of mistake as claiming “no effect” because P was big, but that mistake was obscured by the Bayesian gloss. As I recall back in Epidemiology in 2013 we agreed on the evils of prior spikes, and this is a real illustration.

          So, I agree that a problem with Bayesian inference is its strong claims, which can flow from strong priors (as with spikes) or strong sampling models. I would not identify the problem with uniformity, however, since in some instances uniform can be a perfectly reasonable mechanical prior on a finite interval. I think the real problem in your example is captured by your reframing it as that we have strongly concentrated prior information for parameters on unbounded intervals (not necessarily near zero though), so much so that one Y should not shift our bets much.

          I think “a reader” raised this issue with your uniform example, which can just as well be taken as showing that there is a lot more information in one Y~N(theta,1) than most people realize, and that information is contained in the very strong assumption that the variance = 1. So Y is not “pure noise,” it is very constrained Gaussian “noise” which is telling you a lot (in relative terms) if your prior variance is not small compared to the known Y variance. I’d call the one Y “pure noise” if instead the sampling model said only it was normal with completely unknown variance; in that case you’d get no information out of one Y.

        • Sander:

          Yes, any prior (or, for that matter, any model) can include too-strong information and yield unreasonable inferences. I didn’t mean to imply that the uniform prior is the only one to cause big problems.

      • This article below is also very relevant to the discussion by Sander here (for which many thanks, Sander).

        @article{greenland2016statistical,
        title={Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations},
        author={Greenland, Sander and Senn, Stephen J and Rothman, Kenneth J and Carlin, John B and Poole, Charles and Goodman, Steven N and Altman, Douglas G},
        journal={European journal of epidemiology},
        volume={31},
        number={4},
        pages={337–350},
        year={2016},
        publisher={Springer}
        }

  5. It is common but incorrect to say that with NHST the alternative hypothesis is a non-zero effect size. With the pure Fisherian p-value there is no alternative hypothesis at all. In the Neyman-Pearson construction the alternative hypothesis is the minimum clinically significant effect size. Power calculations are mandatory in the medical literature but apparently not in psycho-linguistics. NHST is an amalgam of the Fisherian and the Neyman-Pearson approaches so if there is no power calculation then it is not NHST.

      • Power calculations are way off relative to what? One can not use the true effect to calculate the power, if the true effect was known one would not be wasting time with experiments and power calculations.

        As he said, the minimum clinically meaningful effect size is used to calculate the power. The true effect might or might not be above this threshold, but p-values with lots of zeros are not uncommon in clinical trials.

        • +1 to first paragraph.

          Re the second paragraph: The devil may be in the details here — I have heard some “definitions” of “minimum clinically meaningful effect size” that seem pretty dubious. (I don’t mean to imply that it should be easy to decide what the minimal clinically meaningful effect size should be, but precisely because of the difficulty, the decision needs to be based on careful reasoning from particulars of the drug and condition to be treated, and not just calculated by some algorithm from the data, especially if the data are of poor quality.)

    • So what we do use is neither Fisherian nor NP, but a weird amalgam of the two. I will rename the section How null hypothesis significance testing (NHST) is typically used.

  6. I suggest revising the statement,

    “It is important to note here that the p-value is the probability of the observed
    t-statistic or some value more extreme, assuming that the null is already true.”(p.5)

    to say something more like,

    “It is important to note here that the p-value is the probability of obtaining the observed t-statistic or some value more extreme, assuming that the null is already true, and only compared to samples of the same size as in the study.”

    Adding “obtaining” clarifies the sentence. Adding the phrase “and only compared to samples of the same size as in the study” helps point out why the “conclusion” does depend on the sample size; this is a point that is very often sloughed over.

    (This might also require similar additions in other places. Also, there might be a better way of phrasing this.)

    • It is important to note here that the p-value is the frequency with which we would obtain a t statistic more extreme than the observed one in a long series of direct replications of the experiment with identical sample size, assuming that the null is in fact true.

    • Sorry, I have to sharply dissent from Martha and Daniel’s comments and suggestions:
      It’s just technically incorrect to say “the p-value is the probability of obtaining the observed t-statistic or some value more extreme, assuming that the null is already true, and only compared to samples of the same size as in the study.” For the paradigmatic t-test targeting a normal-population mean, you have locked the P-value into a draw from a uniform(0,1) distribution once you assume a test value for the mean (null or otherwise). N does not matter if the hypothesis tested (and the normality assumption) is correct.
      It does not matter if you change the sample size across samplings (studies) or even change the underlying population and mean and variance with it: Once you assume each test is of the true mean from the population to which it refers, each resulting P-value is a uniform draw. Neyman (Synthese 1977) explained the analogous property for alpha-level testing. The fact is you know your Type I error rate across these studies – and that is what frequentism (error statistics) is about.
      N matters when your test hypothesis or the embedding model (assumption set) for the test is incorrect (which of course is almost always the case), with power to detect alternatives or statistically identifiable model violations going up slowly as N increases.

        • This is why we should standardize definitions. I had been taking stock [last summer] of different definitions/emphases in different contexts within statistics articles I wish I had kept the list I had made. But I have been creating a novel course in elementary statistics. It’s a hobby for me.

      • Sander, what do you think of this definition:

        A p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.

        • Shravan:
          I’ve seen and even taught that definition or similar ones, and don’t recognize it as technically downright wrong. But I’ve come to regard it as far too thin to remedy prevalent misunderstandings.

          I find P-values pretty amazing insofar as such a small portion of the research community seems to “get” them (unsurprising when as documented they were often taught from books that interpreted and sometimes even described P-values incorrectly, and then their mentors published conclusions and built successful careers based on these errors). Many say that fact shows P-values are too difficult to use properly. I don’t think that’s true; the concepts do take work to grasp but no more so than calculus, which is routinely taught and used throughout science and engineering (although nowhere near as badly). The difference is students get a lot of pre-calculus before being hit with integration and differentiation; by contrast most basic stats I see provide too little in essential background concepts.
          P-value abuse thus looks to me like more an accident of history given the personalities involved (think of Fisher), and radical differences in time and place between the foundational writings and typical applications today. And the educational history looks to me like a game of “telephone” (“Chinese Whispers”) in which the definition and interpretations deteriorated with distance from the originating sources. But, once P-values are freed from prevalent distortions, I find the problems attributed to them are problems of “inferential” statistics in general, e.g., selection (dredging, hacking) artefacts would arise from any statistic whose value determined presentation and emphasis.

          All that has made me think that much more lead-in discussion of hypotheses, models, and basic logic is needed before trying to be precise about P-values. With that background, I will now invite criticisms of how I currently teach P-values to those with some background in stats, which I call a neo-Fisherian view (nothing original here except perhaps the errors):

          Intuitively and roughly, the purpose of a P-value is to interpretably rescale a measure of distance D between our data set and a test model (which includes any targeted hypothesis and all auxiliary assumptions including not only parametric constraints as in regression models, but also assumptions like integrity and competence, e.g., no P-hacking).
          Considering single-sample results, a smaller observed p corresponds to greater distance from the test model to the data; in this sense p provides more information against the model as p gets smaller.
          The information is of the same form that engineers and clinicians use daily: p is the percentile position of d on a chart for D taken from a reference population (see Perezgonzalez J. D., 2015, P-values as percentiles, Frontiers in Psychology, 6:341. doi: 10.3389/fpsyg.2015.00341 and other articles by him on P-values).
          Thus P is used much the same way as a control chart; e.g.. a clinician would refer a patient’s hematocrit to a chart showing percentiles in some reference (“normal” in ordinary English) population; extremes indicate problems.
          One way to quantify the information being conveyed is to use the logworth/surprisal/S-value transform s = log(1/p) = -log(p) with logs in base 2, so that independently obtained information is additive and the scale/unit of the measure is the bit (specifying one of two possible states).

          For those comfortable with vector geometry and probability theory, I add this abstract, technical frequentist (repeated-sampling) view for the continuous case (which no doubt could be improved vastly; also the discrete-data reality needs some additional details to link it to the continuous theory, which I’ll spare us for now, along with other details for math stat):
          D is a distance measure in sample space, with its observed value d being the distance from the data projection (“fitted data”) on test model to the data projection a larger (embedding) model or structure, expressed in units “standardized” to the test model. The embedding model differs in not imposing the targeted hypothesis, e.g., as in a test of a coefficient in a regression model.
          If the embedding model is saturated, as in global tests of fit, the distance is from the tested model to the data so D is a random residual in the sample space; this model still includes the sample space/reference set/exchangeable sequence of possible data sets over which D is computed and all the usually unmentioned yet often violated integrity and competence assumptions.
          Although larger d correspond to larger model violations within the embedding model, d is hard to interpret directly because the distribution of D and thus the size of d is highly dependent on the model forms as well as the data structure.
          One way to extract the information in d is to compute the distribution F of D when the test model is correct, then then take the inverse of F and apply it to D. The resulting random variable P carries information about test model violations (refutational information) within the embedding model; thus:
          If the test model is correct, all that is left in P will be white noise (maximum entropy) variation between 0 and 1, which corresponds to a uniform distribution for the random P;
          If the model is incorrect along some dimension contributing to D, the distribution will shift downward. For any point on the embedding model, this shift is computable (typically from a noncentral distribution for D) and yields a power function.

          That leaves open the interpretation of p and s when (as always) we really don’t think the test model or even the embedding model is correct;
          David Trafimow asked pointedly, what use does that leave for P? (or S?)…
          My answer in brief is that we should then switch to a completely unconditional interpretation of p in which s = -log(p) merely quantifies the information against the model without specifying the source, e.g., in a typical null test of a targeted coefficient, the information s against the test model could be large because the coefficient is not zero or because the embedding model is wrong or some combination; likewise, s could be small because the coefficient is small and the embedding model is approximately correct, or because the the coefficient is large and the embedding model is wrong in a way that cancels in the residual. And, as with any statistic, on top of both influences we have random variation (“noise”) in our measures.

          The uncertainty all that leaves is just a reality which should be exposed by statistics, instead of swept under the rug as it is by conventional interpretations – a point I believe Andrew has made repeatedly and which cannot be repeated enough in various forms.

        • Sander when you say “distance” is this a technical term or not? Are you talking about defining a *metric* on a *metric space*?

          If not, and because your description here is fairly “mathy” I do suggest an alternative wording, something like maybe “measure of adequacy” or some such thing that doesn’t invoke metric spaces and generalizations of euclidean distance.

        • Thanks Daniel for the feedback!
          Yes, “distance” is a technical fine point that like many is hard to work around in a way that will satisfy math statisticians and yet not be completely unintelligible to anyone else.
          By sticking to very regular cases (e.g., classical OLS linear regression) the distance is indeed an ordinary Euclidean distance (and hence a true metric) in the sample space, with the model an affine subspace (at least to first-order approximation). We can hold onto that intuition with most generalizations found in everyday “soft science” applications I’m familiar with, like GLMs. Beyond that the geometry gets heavier along with side conditions, and when we reach misspecified modeling the metric properties break down when using relative information measures like Kullback-Leibler distance (the main measure I’ve seen in that literature); I would welcome more education on that.

          The usual workaround I’ve seen is indeed to go to “measures of adequacy” or the like.
          But for the moment anyway I feel the connection to ordinary (Euclidean) distance offered by the basic cases should be exploited fully, given the sparsity of good intuitive anchors for teaching “inferential” statistics. Without them, bad anchors (like invalid Bayesian inversions) rush in to fill in the mental picture.

        • > sparsity of good intuitive anchors for teaching “inferential” statistics. Without them, bad anchors (like invalid Bayesian inversions) rush in
          Nicely put – I do think pre-statistics as pre-calculus is required.

        • How is the neo-Fisherian view different from the Fisherian view?

          It’s fascinating (and a reflection of the sad state of affairs) that an article titled “P-values as percentiles” can be worth publishing in 2015.

          By the way, your comment about the additivity of -log(p) reminds me of the following passage in Fisher’s Statistical Methods for Research Workers (in section 21.1, introduced somewhere between the 1st 1925 and 5th 1934 editions):

          The circumstance that the sum of a number of values of chi-squared is itself distributed in the chi-squared distribution with the appropriate number of degrees of freedom, may be made the basis of such a test. For in the particular case when n=2, the natural logarithm of the probability is equal to -1/2 chi-squared. If therefore we take the natural logarithm of a probability, change its sign and double it, we have the equivalent value of chi-squared for
          2 degrees of freedom. Any number of such values may be added together, to give a composite test, using the table of chi-squared to examine the significance of the result.

        • Sameera: I have not seen logic as a course requirement in a statistics program.
          Instead there seems to be a presumption that the students get all the logic they need by doing math – a deadly wrong presumption when it comes to connecting math to physical realities.
          In my limited experience the pure mathematicians I studied under seemed quite aware of this problem and even warned about it, but the academic statisticians usually did not, or thought mathematical logic was all there was to logic.

        • Carlos:
          Quick nonrigorous answer: neo-Fisherian is Fisherian without the hostility to things Fisher disapproved of, and even borrowing bits from NPW (Neyman-Pearson-Wald) and Bayes as seems helpful. The particular form I prefer might be better called Coxian (equating Fisher’s fiducial distributions to confidence distributions). Also there is addition of subsequent developments in information theory which Fisher just missed, like those from Shannon and Kullback.

          By “just missed” I mean not only historically but conceptually: The result you mention is the first appearance of logworth/surprisal/S-value that I know of. Fisher used the additivity across independent sources for his meta-analytic test, but I did not see where he recognized that negative log probabilities can be used to measure information (from what I’ve read, Shannon was already on to this by 1938 although his landmark paper did not appear until 1948). Fisher practically got there with his expected-information matrix, the expected hessian of the negative log-likelihood (observed-information matrix) which gives the curvature of the latter (higher curvature = more information) and is the core of the 2nd-order approximation to the entropy of the likelihood, while the observed information is the core of the 2nd-order approximation to the corresponding Shannon information.
          Disclaimer: since I may have some important technical details askew or missing, I’ll point to some cheap or free classics discussing statistics-information theory connections, with the caveat that these suppose the reader is comfortable with probability and statistics at the measure-theoretic level – Kullback focuses toward NPW theory while Jaynes focuses toward Bayes:
          https://www.amazon.com/Information-Theory-Statistics-Dover-Mathematics/dp/0486696847
          http://www.med.mcgill.ca/epidemiology/hanley/bios601/GaussianModel/JaynesProbabilityTheory.pdf
          I’d be interested to hear of more up-to-date coverages that others prefer.

        • This discussion has been very interesting and instructive. However, Sander, would you at least agree that the definition Martha and Daniel gave, which improved on mine, and the definition I quote above (a direct quote from the ASA statement on p-values, the paper nominally written by Wasserstein and Lazar), reflects how p-values are *used* today? Your definitions and objections arise because you want to go back to the original intended meaning of p-values; would you agree with that?

          It feels like a separate discussion which definition should be used. In other words, there is no definition of “the” p-value, as there are at least two different things we are talking about: what Fisher intended, and what they have now become. Is that a fair comment?

        • Shravan: Yes that’s fair, in fact there are several definitions of P-values that have been in use. In books and tutorials I see, the math formula is usually the Fisherian one, but then the verbal description is inaccurately applied to the Neymanian setting. Simplifying a bit, the history leading up to this confusion as I understand it is as follows:
          1) In the original formulation used by Laplace through Student (not yet called P-value, but rather called “significance level” in Edgeworth and Student, apparently as suggested by Venn) a P value is a limiting posterior probability of your estimate being on the wrong side of the tested parameter value as your prior variance expands without bound. In the 1-parameter cases they considered, this P equals mathematically the data tail probability we now call the one-sided P-value. It turns out this proto-P-value also becomes a lower bound on that posterior probability under a broad class of priors – see Casella, G. and Berger, R.L. (1987a). “Reconciling Bayesian and frequentist evidence in the 1-sided testing problem” (with discussion), Journal of the American Statistical Association 82, 106–111, and Casella, G. and Berger, R.L. (1987b), “Comment,” Statistical Science 2, 344−417.
          2) Next, Karl Pearson (who expanded to testing whole distributions) and Fisher started skipping the Bayesian take and using “significance level” for statistic tail probabilities, which Fisher denoted by P and thus also called a significance level “the value of P”. But Fisher also started using two-sided P-values, effectively testing point hypotheses instead of 1-sided hypotheses. By the time he did that researchers had started noting problems like confusion of scientific and statistical significance, yet his concept and terminology became the received ones.
          3) Soon after, Neyman and Egon Pearson introduced (and later Neyman promoted) a theory in which one could jump straight from the statistic to a decision with known error rates under the model, skipping Fisher’s P. But that P-value could be used for the decision provided it was properly calibrated, which is to say uniform(0,1) under the tested model and thus “pure random” by certain definitions, and otherwise shifted downward maximally under alternatives within the embedding model. Take a look at Kuffner, T.A. and Walker, S.G. (2017), “Why are p-values controversial?” The American Statistician, DOI: 10.1080/00031305.2016.1277161 which maintains the only correct view is that P is such a calibrated random variable, or U-value as Andrew calls it (also called a “valid” P-value by many frequentists).
          – Again, the usual definition I see is Fisherian, but that is then put to immediate use in a Neymanian decision rule with its concerns about size and power, which demands calibration (U-values). That leads to confusion, as seen in the present debate. Note especially that calibration does not require the generating experiment or tested model to stay constant, it only requires the model tested in each experiment to be correct for that experiment.

      • “For the paradigmatic t-test targeting a normal-population mean, you have locked the P-value into a draw from a uniform(0,1) distribution once you assume a test value for the mean (null or otherwise).”

        This is indeed a really weird situation to be in. How can I draw any inferences now if I have locked myself into a draw from a uniform(0,1)?

        This is however how Fisher defined it:

        Source:

        @article{goodman1993p,
        title={P values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate},
        author={Goodman, Steven N},
        journal={American Journal of Epidemiology},
        volume={137},
        number={5},
        pages={485–496},
        year={1993},
        publisher={Oxford University Press}
        }

        “Fisher’s definition for the p value, or “significance probability,” was essentially, the same used today: it equaled the probability of a given experimental observation, plus more extreme ones, under a null hypothesis.”

        The paper goes on:

        “[t]he p value was not to be interpreted as a hypothetical frequency of “error” if the experiment were repeated. It was a measure of evidence in a single experiment, to be used to reflect on the credibility of the null hypothesis, in light of the data. … the p value was meant to be combined with other sources of information about the phenomenon under study. If a threshold for “significance” was used, it was to be flexible and to depend on back- ground knowledge about the phenomenon being studied “

        • I wonder whether Fisher’s missives can be understandable to 1st year statistics students; for that is when such concepts should be discussed. Might be asking too much. However as I told John Ioannidis and others that critical thinking course has to precede a course in statistics. Going beyond the thought leaders in statistics as well.

        • “Going beyond thought leaders in statistics”
          – I like that insofar as statistics was severely retarded by blindly following certain “thought leaders” (and by not even getting what they said correct).
          But from what I’ve seen medicine is even worse that way and is more biased by conflicts of interest, with even fewer constraints imposed by pure logic.
          Isn’t critical thinking going to the core logic and empirical facts, regardless of what any “thought leader” in any specialty wanted (or want) everyone to think or do?

        • Sander

          ‘Retarded’ is great characterization. I have been reluctant to spend time reading these ‘thought’ leaders. But I should.

          Yes ‘critical thinking’ is or could be as you suggest. I spent roughly 10 years reviewing some of the newer curricula in critical thinking. Some of the curricula quite good; others not so good. Fundamentally much amount to a volley of exercises in thinking in a variety of ways; but have their limitations. I had been following the work of a host of academics like Robert Ennis, Howard Gardner, Robert Sternberg, Raymond Nickerson, and others work in intelligence.

          Frankly some individuals are just better diagnosticians than others. As to why that may be is queried relentlessly.

          Lastly I am fully cognizant of the ‘conflicts of interests’ conundrum Sander.

          I don’t have your mathematical background, but reviewing some math now.

          I see that academic statisticians do resort to specific thought leaders. I guess it is inevitable. Thanks for your contributions to the blog.

        • It’s only a uniform(0,1) draw if the statistical model really is an adequate model for the data generation process. In theory you learn from 0.05 that your model probably does a poor job representing the data. That’s actually why failure to reject can often be more useful. You choose a model and don’t reject it with several relevant tests, and then you can proceed to use it as an adequate summary of what you expect from your data

        • How are you defining “adequate” here? (Perhaps also “relevant”) With a crappy study and correspondingly hard question you want to answer, it’s not clear that any set of tests will justify proceeding as you suggest.

        • Adequate here is defined as passing the relevant tests… so it’s important to choose tests that tests things that are important to you.

          This is based on per martin-lof’s definition of a random sequence with a given distribution as being one that passes a very stringent test for behaving like a random distribution (in his case, the most powerful computable test, a non-constructive thing).

          Is it possible to fake this and confuse yourself? Of course, but for example if you want to know whether rand() in R works like a uniform random number generator, you should run the “die harder” test suite on that random number generator.

          If you want to know whether your remote-sensing experiment works like a linear trend with a normal(0,1) random error, you could subtract the linear trend, and transform it through the normal cdf and run a strong test of uniformity on the resulting sequence.

          If you want to know whether undergrads in a psych experiment are normally distributed around their opinion about experimental subject X, you’d better not fool yourself by sampling 6 of them and doing something stupid pretending that this is an adequate test of randomness, you can’t test the frequency distributional shape properties of a procedure on 6 numbers. A continuous frequency distribution is basically an infinite dimensional function, so you can only adequately test that it is approximately a certain shape on a large sample.

        • “You choose a model and don’t reject it with several relevant tests, and then you can proceed to use it as an adequate summary of what you expect from your data”

          …no. This is the most basic and common misuse of goodness of fit tests.

        • The procedure is perfectly fine, it’s the practical implementation by people who don’t know what they’re doing that’s the problem. People try to claim that because they didn’t find a small p value in a test for their favorite distribution, in a tiny sample, that their process must have their favorite distribution.

          An extreme example: I want to test if X ~ normal(0,1) and I draw ONE sample and it’s 0.22, that’s clearly not going to trigger p = 0.0003 in any test so obviously my experiment is a normal(0,1) random number generator! WRONG.

          It’s wrong because you have no power to reject a wide variety of alternative RNGs. On the other hand if you collect 8000 data points and can’t reject the hypothesis that they’re normal(0,1) using a stringent test, you’re pretty good to go ahead and assume that the next 8000 will also be close to normal(0,1).

          The only real mathematical definition we have of a random sequence with a given distribution is one that passes a stringent test for it being a random sequence with a given distribution (essentially all other mathematical definitions have been proven equivalent to this one)

        • For *any* goodness of fit test, it’s just an issue of sample size until you reject the null in any real scenario. So if you were to say “I took 8000 samples and failed to reject the null”, the next question is “why not 80,000”? This isn’t just being annoying; unless you can say that “under the minimal divergence (under the assumption that large divergence means more power) from the null that I care about, I should have power 0.9 with n = 8000”, it’s a really meaningless result. Note that defining a minimal divergence is often super difficult.

          But its worse than that. In particular, under most use of NHST on something like a mean parameter, we can actually put a range on a value of interest we are looking at even if we didn’t do the a priori power analysis. So if you take 8,000 samples and get a CI of (-0.01, 0.01) (of course requiring it’s own set of assumptions), you can at least turn around and say “well, with this sample size, I’m fairly certain that the real value is within a range I consider ‘close enough’ to the null”. But in generic “goodness-of-fit” tests, its often difficult put together a meaningful statistic to make such a call.

          Finally, with goodness-of-fit tests, for any model there is an infinite number of ways you could question whether a model is appropriate and any test will only examine one particular aspect. To say “this model is adequate because we failed to reject with large samples” should really be “this single assumption we tested seems adequate because we failed to reject examination of this single aspect that was properly power for …”. For example, if Y ~ Binary(p = 0.5), and we test that the mean is 0.5 under the assumption of normality, for large sample sizes we will get a nice uniform(0,1) distribution. But assuming a normal model for binary is not great in some uses. I realize this is sort of implied by your “stringent test” clause, but how do you define “stringent”? If I wanted to be really “stringent”, I could say “well, a draw from a normal distribution is almost certainly an irrational number, so my test is conclude non-normality if I see any rational numbers in my sample”. I think that’s pretty “stringent”, but it’s not useful. So in reality, under this paradigm, you would need to start picking the characteristics you care about and test them individually under specific tests that are highly powered under deviations from this assumption (and not other assumptions you don’t care about).

          I think several of these points are implied by earlier comments you have made. But I would emphasize that I think it’s actually *really* difficult to use goodness of fit tests in a useful manner, I would say much so than testing a single parameter of interest.

        • You are right that you can reject the null with a large enough sample size unless you’re testing high quality computational RNGs. My proposal glosses over certain aspects, the important ones being that the tests should be sensitive to the kinds of deviations *that you care about*. When you say “defining a minimal divergence is often super difficult” I don’t consider this to be a problem, I consider it to be the *main job of anyone seeking to model things using stochastic models* which is why I get so annoyed by the typical stuff. People say “Hey mathematical models in science are hard, how about we replace them with this push button cult they fed us in ST101: Intro to Stats For Unsuspecting Undergrads.”

          I’m talking about situations like: we have 10 million satellite images of the earth, if we subset those in Oregon, and shuffle them in random order, what is the distribution of the statistic “percentage foliage cover” extracted by our automated image recognition program, and if we bin it into 10 bins does this 10 bin histogram differ from the 10 bin histogram we would have if we modeled the distribution as a continuous beta distribution with parameters a,b using a sample size of 100000 ? If not, we will be happy to model the next 100000 image set coming in from the satellite today as if they came from the given beta distribution for the purpose of detecting whether defoliation is occurring at a large scale due to bark beetles.

        • In particular, I think it’s important to remember that by failing to reject your (various kinds of) goodness of fit tests, you *provisionally* accept the model as adequate for explaining the data generating process, until such time as you encounter a situation in which your stochastic model causes problems… at which point you either develop a new one, or move on to a different kind of model. It’s “as far as we know adequate in these ways” not “yes the data does come from this RNG”

      • Here are some other relevant papers:

        Amrhein, V. & Greenland, S. 2018. Remove, rather than redefine, statistical significance. Nature Human Behaviour 2: 4.
        http://rdcu.be/wbtc

        Amrhein V, Trafimow D, Greenland S. 2018. Abandon statistical inference. PeerJ Preprints 6:e26857v1.
        https://peerj.com/preprints/26857/

        Amrhein, V., Korner-Nievergelt, F. & Roth, T. 2017. The earth is flat (p>0.05): Significance thresholds and the crisis of unreplicable research. PeerJ 5: e3544.
        https://peerj.com/articles/3544/

        • Thanks Shravan for posting those cites, of course I like them all…
          I’d unabashedly add my current favorite:
          Greenland, S. (2017), “The need for cognitive science in methodology,” American Journal of
          Epidemiology, 186, 639–645, available as free download at https://doi.org/10.1093/aje/kwx259.

          Also recommended:
          Hurlbert, S.H., and Lombardi, C.M. (2009), “Final Collapse of the Neyman-Pearson Decision
          Theoretic Framework and Rise of the neoFisherian,” Annales Zoologici Fennici, 46, 311–
          349.

      • Sander:

        “N matters when your test hypothesis or the embedding model (assumption set) for the test is incorrect (which of course is almost always the case), with power to detect alternatives or statistically identifiable model violations going up slowly as N increases.”

        Yes, so what preceded this in your comment is not relevant to “almost all” cases.

        My purpose in emphasizing how sample size occurs in the definition of p-value is precisely to enable understanding of explanation of not just power, but also Type M and S errors.

        • Martha, by throwing in “assuming that the null is already true” you voided “and only compared to samples of the same size as in the study.” That is a technical error, and one that touches on the meaning of frequentism as Neyman (1977) viewed it. Again, for P, not even the hypothesis has to be fixed, let alone N: All that is needed is that each P is computed from the correct model (including the hypothesis). This means that if one wants to falsely reject true models no more than 5% of the time over one’s entire career, one can do so by using alpha=5% on average (not even all the time!). Yes this is weird sounding and I don’t find it viable as a rigid philosophy (nothing is), but serves some use in automated environments where nulls and auxiliary assumptions are usually correct and false null rejection is the most costly error.

          In domains more like ours, where all models are false but some might be useful, there are a lot more than Type I, II, M, and S errors – there is the error failing to formulate and promote interpretations that work well under false models, especially the false assumption of researcher competency in statistics (e.g., no uncontrolled P-hacking). In that view hardly anything in basic textbook and everyday “statistical inference” is relevant.

      • > It’s just technically incorrect to say “the p-value is the probability of obtaining the observed t-statistic or some value more extreme, assuming that the null is already true, and only compared to samples of the same size as in the study.” [….] N does not matter if the hypothesis tested (and the normality assumption) is correct.”

        “compared to samples of the same size of the study” stresses that the sampling distribution of the t-statistic (which the p-value is directly derived from) is the distribution of the t-statistic if this same stastistic was to be calculated from alternative realizations of the data *in an indentical experiment* (i.e. according to the data generation process specified in the model, and for some specific value of the parameter which represents the null hypothesis).

        In general that means fixing N, but it’s true that this might not be the case if there is a stopping rule (and it’s properly included in the model and the p-value calculation).

        • Sorry, Carlos, that’s true but misses the point: P is constructed so that under the model it tests, dependence on anything else is screened off. This property is rarely recognized, but by throwing in “assuming that the null is already true” Martha voided “and only compared to samples of the same size as in the study.” For P, not even the hypothesis has to be fixed, let alone N: All that is needed is that each P is computed from the correct model (including the hypothesis).

        • I agree with Sander, and it goes along with what I was talking about elsewhere recently about how p values test a thing different from what people think they do.

          The model specifies a random number generator, and the p value gives the frequency with which that RNG would produce more extreme values than the one you actually saw.

          the p value *doesn’t* tell you whether the actual experimental procedure will repeatedly give values more extreme than the one you saw… it tells you whether an idealized t distributed RNG would… A lot usually relies on mathematical attractors… as N increases the resulting statistic has a given distribution independent of what the underlying individual data points are distributed like.

          This is where Martha’s statements about same sample size become relevant, in the mapping between the p value, the test, and the relevance of the test to the actual experiment

        • My point is that I think you missed her point… I completely agree that to calculate p, you need to fix the model (null hypothesis, N, etc.). And once p is calculated, it has some meaning (with the usual caveats) making abstraction of those details.

          But the sampling distribution of the statistic underlying the p-value calculation is relevant only in the context of one single model. p-values don’t tell you anyting meaningful about how the statistic calculated from one experiment compares to the sampling distribution of the statistic in a different model, let alone a different statistic (unless the distribution happens to be the same, of course).

        • A toy example:

          The some quantity is measured taking repeated measurements error and keeping only the maximum reading. Assuming the error is normally distributed with variance one, the distribution of the maximum reading is a function of the true value and the number of measurements in the sample.

          Let’s say that we took a sample of size 10 (ten measurements) and the value statistic (the maximum value of the measurements) is 2. If the null hypothesis is that the true value 0, the p-value is 0.2.

          “the p-value is the probability of obtaining the observed t-statistic or some value more extreme, assuming that the null is already [by the way, I don’t like this “alredy”] true, and only compared to samples of the same size as in the study.”

          means that

          “the probability of the maximum measurement being 2 or larger for samples of 10 measurements, assuming that the true value is 0, is 20%”

          It’s the very definition of the p-value, I’m not sure how it’s technically incorrect.

          Of course, if we take a sample of 100 measurements, assuming true value is 0, the probability of the maximum measurement being above 2 (i.e. the probability of observinf a more extreme value for the statistic) is substantially higher (90%).

          Rewriting the previous example in terms of random number generators is left as an exercise for the reader.

        • To be fair, the orginal statement was about “the observed t-statistic”, not about statistics in general. But if I’m not mistaken, the sampling distribution for the t-statistic also depends on the sample size…

        • I thought I’d posted this, but it didn’t appear on my screen:

          Carlos: As far as I can see all your math completely misses my points.
          Yes the statistic depends on N and many other study features, and so does P if the tested model is incorrect. I was pointing out that if the tested model is correct, then P no longer depends on all that – it becomes uniform regardless of N.

          So, are you arguing about the technical fact that in the given example P is uniform under the test hypothesis regardless of N or even the what the test hypothesis is? I hope not because (as Daniel saw) that’s just a math fact that her wording overlooked and that’s what I pointed out first. We can debate the practical importance of such facts but first I’d like to get the math straight.

          Or are you arguing philosophy? In which case you need to read Neyman (Frequentist probability and frequentist statistics. Synthese, 36, 97–131, available in JSTOR) to see your mistake if you think you are correctly characterizing behavioral frequentism of the kind espoused by Neyman, co-inventor of the theory I am discussing. He specifically discusses aggregate error rates across studies that may involve arbitrarily different designs and even different (if perhaps related) hypotheses. His frequentism is a decision theory in which single p-values have no meaning and can even be skipped by seeing if the test statistic falls in the critical region. But if they are used, they are only valid as uniform random variables under the test model.
          Again I don’t hold this philosophy but I nonetheless think it deserves consideration for what it actually says as opposed to subsequent (and often distortive) representations.

          P.S. Breiman once said “there are no Bayesians in foxholes.”
          To which I would add “there are no frequentists either.”

        • I don’t know how to make it clearer, what I’m arguing about is your claim that

          It’s just technically incorrect to say “the p-value is the probability of obtaining the observed t-statistic or some value more extreme, assuming that the null is already true, and only compared to samples of the same size as in the study.”

          I’m not sure if we’re talking pass each other or whether there is an actual disagreement. Would you agree with the following?

          The p-value is the probability of [[[ obtaining the observed t-statistic or some value more extreme in alternative realizations of the study (same experimental setup, same sample size, same analysis) assuming that the null hypothesis (and the whole model) is true ]]]

          If you think that this is also technically incorrect, I think you’re wrong.

          If you think that formulation essentially different from the original one, that’s a matter of opinion and we can happily disagree.

        • Carlos: Your description is (a) commonly taught and believed, and (b) wrong, period.
          How is it wrong? Logically wrong in the unnecessary insertion of “only”.
          It’s a good example of the Frankenstein monster that results from incoherently merging Fisherian and Neymanian concepts of repeated sampling (which was unrecognized in the textbooks I had, apart from Cox & Hinkley 1974). Since I can’t make you see the logic error, again try reading Neyman (1977).
          If you dropped “only” your description would no longer be logically wrong, it would just be unnecessarily restrictive (and much like what I taught for decades).
          The general misunderstanding is another one we might blame on Fisher’s writing: His focus on finding and using a minimal reference set (exchangeable sequence, “no recognizable subsets”) points in the opposite direction, adding to the study restrictions that a maximal ancillary be identical across the “repetitions” so that inferences are properly conditioned (as some would say, the first step on the road to Bayes).
          Neyman had a different purpose in mind: controlling error rates across entire research domains, so his sampling framework went in the opposite, expansive direction. But even Fisher used this expanded idea in his meta-analytic test combining P-values: Note how that has no requirement for the P-values to come from identical study designs or embedding models; we simply expect them to have the same target hypothesis for the combined result to make much contextual sense.

        • Carlos said,

          “Would you agree with the following?

          The p-value is the probability of [[[ obtaining the observed t-statistic or some value more extreme in alternative realizations of the study (same experimental setup, same sample size, same analysis) assuming that the null hypothesis (and the whole model) is true ]]]”

          I don’t know if this was addressed to me or to someone else (thread has gotten hard to follow), but I think this says what I was trying to say better than what I said. My earlier attempts were sloppy, making the mistake of trying to alter what Shravan said, rather than rewrite the whole sentence.

        • Carlos I agree the word only in Martha’s statement is problematic, the p value is also the probability of getting a more extreme t stat for other sample sizes as well, when you assume the model used to calculate p is true… Of course the model is different for different N but each time we are assuming the model used is true…

        • This discussion is getting difficult to follow, and clearly the typos / missing words / extra words / inaccurate quotes in my comments don’t help. I’ll try to do better starting now.

          Martha: the question was for Sander.

          How to explain to non-statisticians what are p-values is a subject that has been treated in this blog many times. Clearly it’s a difficult task if statisticians cannot agree on that are p-values either!

          I don’t think that your definition (which was a patch on Shravan’s definition) was *wrong*. I tried to rephrase it in a form that I hoped Sander could agree with (but I don’t know yet if he does). It’s good to know that this alternative formulation does indeed correspond to what you meant.

          Daniel: “the p value is also the probability of getting a more extreme t stat for other sample sizes as well”

          This is not true. Assume a normal variable with unknown variance. Your null hypothesis is mu=0. Your study has N=5 and the t-statistic is 1.45.

          You perform a one-tailed test. The p-value that you get is 0.11.

          For a study with N=5, the probability of getting a more extreme t-statistic [i.e. a t-statistic > 1.45] assuming that the null hypothesis is true is 0.11.

          For a study with N=10, the probability of getting a more extreme t-statistic [i.e. a t-statistic > 1.45] assuming that the null hypothesis is true is 0.09.

          For a study with N=25, the probability of getting a more extreme t-statistic [i.e. a t-statistic > 1.45] assuming that the null hypothesis is true is 0.08.

          Only when N=5, as in the original study, the probability of getting a more extreme value for the t-statistic that in the original study (we got 1.45) is equal to the p-value.

          Sander: “Your description is (a) commonly taught and believed, and (b) wrong, period.”

          I’ll come back to you, but if you don’t mind I’d like to understand what I’m being accused of precisely. Would you be kind enough to tell me what is *my* description?

          As far as I can see my description was

          “The p-value is the probability of [[[ obtaining the observed t-statistic or some value more extreme in alternative realizations of the study (same experimental setup, same sample size, same analysis) assuming that the null hypothesis (and the whole model) is true ]]]”

          but it doesn’t contain the word “only” at all so I’m not really sure if you consider it wrong.

        • Right so each of those numbers is the p value in those conditions… each one has a different distribution and so the numerical quantity changes.

          I think you are talking about the numerical quantity and I am talking about the abstract concept. In every condition where a p value is calculated, it represents the probability that the observed test statistic has a more extreme value than observed in replications of the experiment. The sample size enters in that it alters the hypothesis distribution. But we always have our assumption being true… So when we change N we change our assumption, and we change our numerical quantity, so that it meets the same definition.

        • Daniel:

          You’re right that I’m talking about numerical quantities. When we say that “the p value is the probability of getting a more extreme t stat [than the one observed]” both the observed t-stat and the calculated p-value are numerical quantities. I don’t know what’s the point in giving a “conceptual” interpretation that is not compatible with t-stats and p-values being numerical quantities.

          I don’t know either where do you see multiple p-values in my example. There is one single dataset, let’s say it was {-0.975 0.025 1.025 2.025 3.025}, one single t-statistic 1.45 and one single p-value 0.11. Which other numbers are p-values in what conditions? How do you calculate them?

        • Carlos, thinking about my conception of a p value, I could semi-formalize it as follows:

          Any output from a p value calculating procedure is a p value.

          A p value calculating procedure is a procedure that takes a random number generator and an observed quantity (or a quantity calculated from an observed quantity) and returns the probability that the random number generator given would generate a quantity more extreme than the observed quantity under repeated generation.

          Then the quantities you mention in your post 0.09 and 0.08 are legitimate p values in addition to the quantity 0.11 you mention. It’s just that each one is relative to a different RNG (in your case because N changes the degrees of freedom of the t distribution).

          So, when you say “assuming the hypothesis is true” or some such thing… you’re already assuming whatever is necessary to make the newly calculated value “the correct” p value.

          I agree with you then, that the value 0.11 is not the p value for an experiment of size 33 whose test statistic is t=1.45, whereas it is the p value for N=5. However to say that the p value is *only* for experiments of the same N suggests either that:

          1) p values calculations are *always* sample size dependent

          or

          2) The only way of describing “more extreme” is in terms of say the value of t itself.

          It makes sense that for example for repeated experiments, where you use the same p value generating procedure (including a choice of RNG) that the frequency with which you’d be getting less than a given value is in fact p (that is, that the process of choosing an RNG based on the details of the experiment is itself correctly calibrated, and the p value is how “probabilistically extreme” in some sense the result is. This is in some sense the purpose of p values, they describe how unusual a given thing is on an experimental details independent way (assuming a correct “meta” hypothesis of how the details of a given setup map to an RNG)

        • > A p value calculating procedure is a procedure that takes a random number generator and an *observed quantity* ….
          > Then the quantities you mention in your post 0.09 and 0.08 are legitimate p values in addition to the quantity 0.11 you mention.

          The only observed quantities in my example are the data set {-0.975 0.025 1.025 2.025 3.025}. How can you calculate a legitimate p-value equal to 0.09 or 0.08 using this data and a random number generator that makes any sense?

          > However to say that the p value is *only* for experiments …

          I don’t know what does it mean to say that the p-value “is for” experiments like this or that, I don’t think I’ve said such a thing.

          I’m talking about the definition of the p-value as the probability of getting a more extreme value of some statistic than the one observed, conditional on the model. And I’d say that the only way that the RNG-generated value of a statistic can be more extreme than the observed value of the statistic is in terms of the values of the statistic.

          Believe me, I understand what is the purpose of p-values and how they are useful. But when you’re looking at repeated experiments and their p-values the underlying statistics used to calculate those p-values are already out of the picture.

        • Not so sure it’s worth trying to explain any more, having difficulty tracking all the bits and in the end I know you know what a p value is, so this could only possibly benefit some random 3rd party on the internet, most of whom are probably not even reading ;-)

          Imagine we have something more complicated than a t statistic, like we take a large set of data and we subset it by some categories and we calculate a complicated function of that data.

          Now, you could under repeated sampling have the hypothesis that the repetitions will be say Q distributed so that p = invQcum(observed) and get a p=0.11 and then you could have a competing hypothesis that under repeated sampling the repetitions will be say R distributed so that p = invRcum(observed) and get a p=0.09, and it could be quite plausible that either of these are correct, it depends say on features of the world that we’re not really sure of (like say the actual shape of the mass distribution of a certain species of south american leaf frogs…)

          which is the “scientifically correct” p value? It depends on which hypothesis is “correct”. Now if we say “assuming each hypothesis is correct” which is the correct p value? then the answer is both of them… it’s another way of saying “which of these was calculated correctly based on the assumptions” and if invQcum and invRcum are programmed correctly… then they’ll give the right answer under the assumptions.

          Now it’s entirely possible that say independent of how big of a sample you take, the observed statistic is, if it’s Q distributed, it’s always Q distributed for every sample size… so then it’s not the case that N matters for example.

          At some point Martha said “…only compared to samples of the same size as in the study” which might be an accurate statement for certain specific cases, but in general does not need to be.

          I’m not sure you’re actually defending her wording, or not, but I do acknowledge that it isn’t an *essential* part of being a p value that they can only be compared to repeated samples of the same size.

          To give a trivial counterexample, we can say our “test statistic” is the value of the first collected data point… and our test is “whether it is from the assumed normal(0,1) distribution” and in fact the p value here will be completely independent of sample size ;-)

        • Carlos said:
          “I’m not sure if we’re talking pass [past?] each other or whether there is an actual disagreement.”

          I do think we’re talking past each other. In an attempt to clarify: I am trying to give a *definition* of p-value that would be suitable for Shravan’s audience. I am not sure what Sander and Carlos are trying to do.

        • Carlos and Martha:
          First, I apologize for my contribution to the confusion here – the “only” I was objecting to is found in Martha’s initial comment bracketed here:
          [“It is important to note here that the p-value is the probability of obtaining the observed t-statistic or some value more extreme, assuming that the null is already true, and only compared to samples of the same size as in the study.” Adding “obtaining” clarifies the sentence. Adding the phrase “and only compared to samples of the same size as in the study” helps point out why the “conclusion” does depend on the sample size; this is a point that is very often sloughed over.]
          which was then quoted a few times in Carlos’s replies to me. As I have been trying to explain and will continue to try, the “only” makes it wrong. But I was confusing that repeated quote with what Carlos said directly. Sorry.

          That said, I still object to what I see in direct quotes from Carlos:
          a) “The p-value is the probability of obtaining the observed t-statistic or some value more extreme in alternative realizations of the study (same experimental setup, same sample size, same analysis) assuming that the null hypothesis (and the whole model) is true”
          – This is not wrong logically and is common, but it is “wrong” for frequentist statistics in the sense of being too restrictive, as I have been trying to explain.

          To see what I mean by “frequentist”, take a look at Kuffner & Walker, Why are p-values controversial? (Am Stat 2017, DOI: 10.1080/00031305.2016.1277161) which maintains that a P-value is only properly defined as the smallest alpha-level at which the test model can be rejected (as per Lehmann). That makes P the inverse-cdf transform of a sufficient statistic, hence uniform under the test model (a U-value as Andrew calls it, or “valid” P-value to neo-Fisherians). Following Neyman, and as I have been trying to explain, exploitation of this uniform-calibration property does not require the generating experiment or tested model to stay constant; it only requires the model tested in each experiment to be correct for that experiment. In fact were that not the case, frequentist methods would be useless in “soft sciences” because there is never anything like identical replicate studies “alike apart from random variation” (outside of computer simulations)…

          b) “But the sampling distribution of the statistic underlying the p-value calculation is relevant only in the context of one single model. p-values don’t tell you anything meaningful about how the statistic calculated from one experiment compares to the sampling distribution of the statistic in a different model, let alone a different statistic (unless the distribution happens to be the same, of course).”
          That’s true for the original test statistic and precisely why we transform it to a P-value or better yet, an S-value. These statistics do have the same distribution if all the test models (assumption sets) used to compute them are correct (all the P are uniform, all the S are unit exponential = half chi-squared on 2 df). This is the basis for applying frequentist methods that aim to control error rates across a whole research topic.

          To illustrate with an example from a recent CDC experience, suppose the targeted hypothesis H is composed solely of this null: “Flu vaccination in pregnancy does not alter the risk of spontaneous abortion (SAB=miscarriage).” Studies of this target vary quite a bit in design features, not just sample size. For study k, these design features (like the size of the study and the sampling design: cohort, case-control, etc.) are part of the assumption set used to compute the P-value pk from the study. They are thus part of the embedding model Ak (which is the set of auxiliary assumptions used to compute pk).
          The P-value pk is then computed from the test model Mk = H+Ak (where the plus is set union). Among many things we can do with the collection of pk is plot them; if they concentrate somewhere instead of being uniform (as they do in this case) we are alerted that either H is wrong or some elements of Ak are wrong or both, or that there is study-publication bias. Yes that leaves a lot of uncertainty, but that is intrinsic to the test, and reducing it requires getting deeper into the studies and context to better identify what caused the nonuniformity. That will however usually leave considerable residual uncertainty about how much the concentration is due to various sources (including randomness, although the idea that the concentration is purely random is tested by Fisher’s meta-analytic p computed from the 2K df statistic 2s+ = twice the sum of the S-values).

          Where does all this leave tail definitions of P-values? My view is that those tail-P are not frequentist-interpretable if they are not also U-values – but not only because of the importance of uniformity for error calibration.
          It’s also because that uniformity is the basis for interpreting the S-value -log(p) as the information against the test model supplied by the original test statistic. The latter continuous-information interpretation makes uniformity an important criterion for P-value validity in what I call the neo-Fisherian (information-measurement) view.

          If you disagree with the above on technical grounds that is disagreement on matters of math fact, so hopefully anyone else reading this thread can sort who is correct by doing the math.
          If you disagree on other grounds, I can only imagine that you are not really frequentist, which is fine; neither am I, although I have no problem using frequentist conceptual models and methods as seems appropriate, especially as checks on Bayesian modeling (“Boxian Bayesianism,” also found I think in Don Rubin’s writings). As far as I’m concerned that requires U-values, not just tail area P-values; in particular, posterior predictive P-values are not adequate model checks by this frequentist validity criterion.

    • I just realized (in response to someone else’s comment) that I didn’t say what I intended to say. My suggested revision should have read,

      “the p-value is the probability of obtaining the observed t-statistic or some value more extreme, assuming that the null is true, and only compared to samples of the same size as in the study.”

      i.e., I would have omitted the word “already”.

  7. A small thing I might’ve noticed… I believe Kruschke’s DBA’s 1st edition came out in 2011 and 2nd edition in 2015; in your list of references it is marked to be from 2014.

    • Oops. My ebook edition says “Copyright Elsevier 2014” on the first page, but on the second page it says Copyright 2015, 2011. I’ll just fix the .bib entry; thanks.

      • No problem, it was a fun coincidence: just the other day I was myself preparing a bib-file to which I needed the Kruschke reference, so it stuck out.

  8. Seems like there is not really a p-value or NHST problem so much as a problem in the field itself (psycholinguistics, medicine, etc. etc.) with not being able to do or not wanting to do direct replications or as close to direct replications as possible?

    Justin

    • If studies were properly powered, direct replications reported in the same paper, pre-registration done, data and code released on publication, no p-hacking, no researcher degrees of freedom, and every expt done is somehow reported (instead of the best one of n expts), I think it would not matter much what approach one uses, Bayesian or frequentist.

  9. Having just read pages 37-41, it might be interesting to calculate the prior biases as discussed concisely here – Statistical Reasoning: Choosing and Checking the Ingredients … Entropy 2018 http://www.mdpi.com/1099-4300/20/4/289

    Briefly, sample from the null and carryout the whole adaptive process with fake data to get the frequency of A, B, C, D, E intervals and then sample from each end point of the ROPE and carryout the whole adaptive process with fake data to get the frequency of A, B, C, D, E intervals.

    With poor design or small sample size you will get poor frequencies, while with good design and larger sample sizes you will get better frequencies. Might be informative.

  10. I like it.

    One nitpicking detail: Strictly speaking the unbiased estimator of sample variance (dividing by n-1) is not the maximum likelihood estimate of the variance (dividing by n; page 4).

  11. Sander,

    I agree to most of what you wrote. Actually the only thing I disagree with is your description of my definition of the p-value as the probability of (…) as “wrong” (whatever the quotes may mean).

    If you think that my definition is too restrictive, could you fill in the dots to propose a less restrictive definition?

    When you refer to the “exploitation of this uniform-calibration property” I think that what you’re saying is obviously correct but unrelated to the discussion (which is about the definition of the p-value).

    An imperfect analogy that may or may not help:

    A] I have a coin.
    (I have an experiment: model including a specific null hypothesis, data, sufficient statistic with value t0.)

    B] I make a statement about the coin being fair.
    (I make a statement about the p-value being x.)

    C] “The coin is fair” means that the sampling distribution for the coin is a sequence of 50/50 coin/head events.
    (“The p-value is x” means that the sampling distribution for the statistic in this particular experiment, i.e. for alternative realizations of the data generated using the same model which includes the null hypothesis, is such that the probability of getting a value of the statistic more extreme than t0 is x.)

    The definition of fairness (p-value) is given in C, what follows is a “property” of fairness which is irrelevant as far as the definition of fairness is concerned. (I won’t give the reformulation in p-value terms to keep things simple.)

    D1] If I’m given the opportunity to bet on 10 tosses of a single fair coin at favourable odds (v.g. head: win $2, tail: lose $1), it’s rational to do so because the expected return is positive ($5, with only 8% probability of ending with a loss).

    D2] If I’m given the opportunity to bet on 1 toss on each of 10 different fair coins at favourable odds (v.g. head: win $2, tail: lose $1), it’s rational to do sobecause the expected return is positive ($5, with only 8% probability of ending with a loss).

    The “frequentist property” D2 may be “less restrictive” than D1, but this won’t change the fact that the fact of a coin being fair is related to the sampling distribution of that particular coin.

    If instead of a single fair coin or ten different fair coins we were to pick at random each time a coin with two heads or a coin with two tails the bet would also be favourable but that wouldn’t make either of those coins fair.

  12. Carlos:
    With apologies, I fail to see the relevance of your coin example to what I’ve been saying.
    To understand why, I think you’ll need to carefully read the citations I’ve been sending.

    I’ve explained several times in several ways what I find “wrong” with the original tail-area definition you present is from the standpoint of Neyman-Lehmann frequentist theory, where the quotes mean not wrong logically or mathematically, but wrong for use in that theory, which (after all the brickbats thrown at it) today remains one of the standard toolkits and in wide (mis)use.

    You absolutely must read the TAS article by Kuffner & Walker (KW) that I cited to see the logically different definition of P tailored for that theory, straight from Lehmann’s classic “Testing Statistical Hypotheses” (1st ed 1959): Using a sufficient test statistic, P is the smallest alpha allowing rejection of the test model (KW target a test hypothesis, so like most writers they assume implicitly the rest of the model is correct). And again, read Neyman Synthese 1977 for the underlying idea of using tests across different studies for the calibration sequence. (KW do not however mention that certain insufficient/inefficient statistics may produce a valid P robust to certain auxiliary-assumption violations – a variation on the old bias-variance tradeoff.)

    Also, please take a look at at the lengthy “neo-Fisherian” description and definition I gave on this thread in response to Shravan’s request, which meets both the Fisherian and Neymanian requirements: P = F(D) where F is the cdf of the test statistic D under the test model, which forces P to be
    (1) a tail area, (2) the smallest alpha leading to rejection, and (3) uniform under the test model
    (e.g., see p. 66 of Cox & Hinkley, Theoretical Statistics, 1974).
    Thus, F becomes the f in KW. [NOTE: I hereby correct an error in my definition statement: where I said “take the inverse of F and apply it to D”, I should have said “take F and apply it to D”.]

    • ‘A p-value is a bijection of the sufficient statistic for a given test which maps to the same scale as the type I
      error probability.’ Kuffner Walker

      Uh boy this does it. To be frank NOT Frank err Sander I’m trying to keep up lol

    • CORRECTION TO MY NOTE noting correction: P=1-F(D) in continuous case, where F is the cdf of D.
      See Sameera, I’m having trouble keeping up too!

    • I find amusing that my definition of p-values is logically and mathematically correct but it’s “wrong”. I still don’t understand how it’s different from a “less wrong” definition, but I can live with that.

      I just read the KW paper and it doesn’t seem very interesting to me. The p-value is obviously an statistic, given that it is a function of the data. And it is obviously a function of any sufficient statistic. I’m not sure I can agree that it is bijective (if we consider a two-tailed test in the one-dimensional case two different values of the statistic will be mapped to the same p-value and the p-value will no longer be a sufficient statistic) but I’m willing to restrict the discussion to one-tailed tests and well-behaved functions to keep things simple.

      It’s a bit misleading to say that “the p-value is not itself defined as a probability” given that the p-value is defined as the critical value of alpha and alpha is defined as a probability. And as far as I can see this “type I error probability” is calculated using the sampling distribution for the sufficient statistic S(X) exactly as in my (too restrictive?) definition of p-values.

      I completely agree with them on their main point: the only use of a p-value in decision theory is to check if it’s above or below a pre-specified significance threshold alpha. Once Tester 1 sets alpha=0.05 any p-value<0.05 will lead to a rejection but the precise p-value is irrelevant. Tester 2 and Tester 3 are doing something else, not Neyman-Pearson hypothesis testing (which is fine because p-values were not invented for decision theory; as they mention decision rules don't require the notion of p-values at all).

      • Carlos: OK, it seems perhaps we’re down to very fine points of disagreement and divergence.

        First, I find KW interesting sociologically. Theirs is a new paper in an official ASA journal and it says:
        “As tempting as it is, especially when teaching nonmathematical undergraduates, to immediately jump to an intuitive explanation of a p-value as a probability, we must recognize that this simplification is undermining the legitimacy of statistical methods in the sciences.”
        That’s a pretty strong attack on what you and I and all basic and mid-level textbooks I’ve seen start with as a definition – a definition they call “an intuitive simplification”!
        At the very least, I hope their statements make you understand the (somewhat bemusing) sense of “wrong” I meant: In the (profoundly narrow) view of KW, it’s a philosophical and even moral wrong to define P-values as probabilities.

        Like everyone else, KW write as if their definition of P-value is the only correct one… But then, their narrowness is no different than what everyone else including you and I have done:
        Present just one definition of a fundamental concept as if it were the only one, when in fact (as I described elsewhere on this thread) there are several nonidentical definitions that have seen common use.
        I don’t think you’ll find such confusion about fundamental definitions of “force” in physics, and when I’ve seen variations in those fields (like “valence” in chemistry) there have been warnings about them.
        In comparison, all you have to do is look across authoritative textbooks to see how statistics is crazily balkanized. Not even “significance level” is defined consistently! Yet some statisticians have the temerity to blame users for being confused. So when statistics presents itself as a mature, settled mathematical field (as in KW and typical textbooks) I think it’s more like child pretending to be adult (except not the least bit cute). My view is that the “toolkit” approach to teaching is part of the solution to this problem.

        That brings me back to your initial inclusion of the “identical study repetitions” (ISR) condition in the definition of P-value. Once again that condition is commonly taught and once again it is possible to weaken it considerably by realizing that condition is an unnecessary reification of the reference distribution computed from the test model. And, as Neyman knew, it is essential to drop the ISR condition to get serious empirical mileage out of freqentist concepts. Formally, the ISR condition misses the fact that the model and hence the reference distribution from which P is computed can become a random function. The idea of models and distributions being random is I think still rarely covered in first-year probability and statistics, yet it is effectively lurking in random-coefficient and other hierarchical modeling methods.

        Finally, a technical correction to your comment on bijection: For KW, the formal test statistic S(X) in a two-tailed test of (say) mean=mu0 would be the absolute value or square of the standardized distance from m0, so bijection holds.

        • Regarding the two-tailed issue, they say that “as a function of a sufficient statistic for a test of some hypothesis, H, a p-value is also a sufficient statistic for that same test.” I understand sufficient statistics to be relative to a model, not to a test. Google doesn’t find me anyone who has used the expression “sufficient statistic for a test” before and I don’t understand what does it mean.

          In their example of X~normal(theta,1) the sample mean is a sufficient statistic for that model. If we want to do a one-tailed test of H0: theta=mu0, the p-value is indeed a bijective function of the mean and everything is fine and dandy.

          However, if we want to do a two-tailed test the p-value is not a bijective function of the mean. It may be a bijective function of the statistic S(X)=|mean(X)-mu0|, but this is not a sufficient statistic of the original model. Maybe we’re supposed to define a different (and more complex) model Y~f(theta,mu0) for S(Y)=|mean(X)-mu0| to be a sufficient statistic of that model?

          Maybe I’m missing something, in any case this is just a distraction from the main argument and I’m happy to leave it at that.

        • I myself have been amazed at the confusion caused by the characterizations of p-value and significance level, as Sander has explained in far more detail. There isn’t much in the way of critical thinking/logic in a 1st year statistics course. Some of the types of observations that Sander Greenland has put forth are not necessarily beyond a discernment by a 1st year student. The aura around statistics may compel students to silence their confusion. Perhaps some teachers do so also.

          The adjective ‘sufficient’ was a new one on me though

        • Sameera: Check the Wikipedia article on “Sufficient statistic”. Sufficiency is a concept I’ve not seen discussed in basic stats for researchers, yet it’s fundamental for traditional theoretical and math stat.
          Intuitively, a statistic is sufficient under a model if it captures all the information about the free (explicit) parameters in the model. It’s a model-dependent concept, so if you recall what I’ve written that means I think it’s dangerous, if reified although it’s useful enough if the model dependence is kept in mind.
          If you really want to get tangled up in the endless controversy over logical consequences of formal abstractions in statistics, check out the conditionality principle (also on Wikipedia). That has a more direct connection to what Carlos and I were arguing about. Basically, hardcore Bayesians and likelihoodists automatically obey the conditionality principle, Fisherians try to obey it with effort, and hardcore Neymanians reject it (in fact one of Neyman and Fisher’s battles from the 1930s is connected to it via causal modeling; I had a 1991 TAS article reviewing that connection and pointing out some logical oddities of the Neymanian view).

        • Related to the point of “Sufficient statistic” being dangerous –

          “The theoretical literature has suggested that sufficiency and probability models that admit sufficient statistics of dimension less than number of observations are important concepts that have a role to play in what should be reported and what class of models are useful. Fisher was as misleading as any here, suggesting that the log-likelihood (sufficient) be reported in studies so that the studies could later be combined by simply adding up their log-likelihoods. Cox makes a similar suggestion involving profile likelihoods[27]. These suggestions assume complete certainty of the probability model assumed by the investigators – if this should differ between them or at any point come under question (i.e. such as when a consistent mean variance relationship is observed over many studies when assumptions were Normal) – ” you essentially need the all the raw data. https://phaneron0.files.wordpress.com/2015/08/thesisreprint.pdf

        • Regarding the p-value not being defined as a probability, their definition of the p-value is that it is one particular instance of alpha and their definition of alpha is that it is a probability.

          The p-value that they are getting is obviously a probability, by construction. How can you get something different than a probability when you define a set of probabilities and pick one of them?

          Using their definitions: S(X) is a statistic, T is a test that rejects H when S(X)>c_alpha

          The probability of type I error is alpha = probability( S(X)>c_alpha | H )

          Given data x and a value of the statistic s=S(x) they define the p-value as the lowest alpha such that s>c_alpha
          which is, given that alpha and c_alpha are inversely related, the alpha corresponding to s=c_alpha

          p-value = alphaHAT = probability( S(X)>s | H )

          Unsurprisingly! I don’t think that strongly denying that p-values are defined as probabilities is going to make them easier to understand or use.

        • Carlos: you wrote “The p-value that they are getting is obviously a probability, by construction. How can you get something different than a probability when you define a set of probabilities and pick one of them?”
          My answer: Because what they call “the P-value” is the random variable that I denote P and they denote alpha-hat, which is not a probability but is instead a measurable function on the sample space which ranges over (0,1]. Each realization of P, which I and KW denote p, is computed from different points on a cdf. These numbers are each probabilities, which does not make P a probability. Following Fisher instead, you define “the P-value” as a realization p rather than the random variable P. Each of you are talking past each other in part because of refusing to recognize this difference in definition of “the P-value.”

          I emphasize again that pure Neymanian frequentist theory as in KW deals only with random variables; individual realizations are just outputs to trigger decisions. Pure subjective-Bayes theory is the same in that regard, differing in that the random variables are only encodings or mental representations of information on possibly unknown quantities, instead of being some product of a physical random process. In this regard those theories have more in common with each other than with Fisherian or reference-Bayes (“objective Bayes”) theories; that commonality may partly explain why Neyman was more open to de Finetti’s approach as a legitimate rival theory than to Fisherian inference (which he regarded as a confused mess).

        • Sander: They define alpha as a probability (they say it so explicitely, it’s not my interpretation). It’s quite reasonable to say that alpha-hat, the critical value of alpha for the observed value of the statistic S(x), is a probability.

          Now, I can see you point about the observed p-value, which is a function of the observed data, being different from the P random variable which is a function from the data random variable X.

          But I don’t see how this changes the definition of p-values as probabilites. It even makes it clearer! The random variable P is defined as a function of the random variable X, i.e. alternative realizations of the data generated from this particular model.

          The random variable P is distributed uniformly

          The cumulative distribution function cdf(p)=p

          The “probability of observing a p-value lower than the observed value p” equals p

          The “probability of observing a value for the statistic S(X) more extreme than the observerd value S(x)” equals p

        • As I explained, they want “P-value” to mean the random variable P, which is not a probability. You say “It’s quite reasonable to say that alpha-hat, the critical value of alpha for the observed value of the statistic S(x), is a probability.”
          That’s true and continues to miss completely what they are saying: The alpha-hat they define from Lehmann-Romano 2005 is a function of the random variable S(X), not a realization S(x), which makes it the random variable P, not your probability p.
          I admit KW’s wording and notation could have been much clearer, e.g. using S1 and S2 for values (realizations) of S(X)- perhaps the only reason I can decipher KW while you can’t is because I first learned stat at UC Berkeley when Lehmann was chair and his book a sacred text, so their foundation is not alien to me.
          But I’ve clarified what I think is the core problem: You keep insisting a P-value is p and they insist it should P.

          There are systems of statistics and terminology other than the one you keep repeating, which is common and extraordinarily narrow as well (especially in the utility-negating ISR requirement). All such belief systems are confining in their own way and have plenty of abuse potential as well as some limited utility.
          Statisticians have been fighting and users complaining about significance tests and related concepts for a good century, at least. The inability of most writers of authoritative comments, tutorials and texts to recognize let alone embrace and teach the full conflicts and divergences (even within frequentism or within Bayesianism) is a reason why I would not wager settlement of current disputes will precede peace in the Middle East.
          That’s why I continue to recommend those writers who teach the diversity even when they prefer one approach over others (e.g., Cox, Efron).

        • Had there been a catalogue, in one place, of all the definitions of p-values & significance levels, statistics may have been further along. One then could have seen clearly why there has been confusion. It should have occurred 30 years ago probably. Certainly one has to read an awful lot of articles to discern the changes in connotations and emphases within and across different fields.

        • Thanks for your insistence, now the source of the misunderstanding is apparently clear and in the rest of this comment I will identify properly the “random variable p-value” and the “observed p-value”, which is a realization of that RV, to avoid ambiguity.

          It seems that we agree that the “observed p-value” (a function of the “observed x”) is a probability. Each realization of the “random variable p-value” (a function of the “random variable X”) is a probability.

          Hopefully saying that “the *observed* p-value is the probability of obtaining the observed t-statistic or some value more extreme in alternative realizations of the study assuming that the null hypothesis is true” is less “wrong”.

          The random variable is not strictly a probability even if their realizations are defined as probabilities, I can agree with that.

          But if you have a group of people and define a “random variable height” as the height of a random individual, it seems reasonable to say that because the “observed height” is a length, the “random variable height” is a length.

          And if you define the data “random variable five heights” as a sample of five random heights and the “random variable mean height” which is statistic of the data, it seems reasonable to say that the mean of heights (which are lengths) is a length and that the “random variable mean height” is a length.

          As I’ve said before, I don’t see the definition of the (observed!) p-value as a probability to be narrow or limiting. I have no problem in defining it as probability (on the sampling space of identical study repetitions) given the observed data *and* understand its use as a random variable for meta-analysis or whatever.

          The probability distribution of the random variable p-value (derived from the distribution of the statistic over identical study repetitions) is of course uniform by construction, so p-values from different studies can be put together (as long as they are “true”!). Fair coins can be mixed and exchanged, but saying that a coin is fair is a statement about the probability distribution for “identical coin repetitions”.

        • I forgot to qualify the last “p-value” in the previous comment, but I guess that reference works for both the random variable and the realizations.

        • OK Carlos, we finally agree on one of my two main points: keep P and p distinct.
          Moving on, I still think you are confused in a way I find very common among stat texts and instructors, which then confuses students:
          We now agree down to where you wrote “the ‘random variable height’ is a length” but no, the RV “height” is not a length any more than a ruler is a length; the RV is an abstraction of a measurement protocol which outputs heights when given an input person. So “height” is a variable whose specific observed outputs are heights, just like P is a variable whose specific observed outputs are probabilities p. Many gloss over this distinction as nitpicking, and I hold doing so is a lazy educational mistake responsible for endless confusion among poor students trying to connect the math to something concrete in the world.

          You said “As I’ve said before, I don’t see the definition of the (observed!) p-value as a probability to be narrow or limiting.” As long as you insert “observed” without parentheses then I don’t disagree. I simply maintain it is an educational mistake to drop “observed,” and a companion mistake to not discuss the random analog P which is the abstract device outputting those observed values p. Why? Because it is P which is the focus of all those error-rate concepts that are the core of frequentist decision theory. When we discuss the inaccuracy of using a ruler as a method to measure (say) finger lengths in a population study, we are not talking about the error in one measurement of one finger, we are talking about how the errors distribute over multiple measurements from the ruler over many fingers on many people. Likewise when we discuss the concept of decision errors from using an NP hypothesis test, we are not talking about the error in using one observed p compared against a fixed alpha, we are talking about how (Bernoulli) errors from such comparisons distribute over multiple measurements from P over multiple nonidentical studies.

          The narrowness I was complaining about is in phantasmagorical “identical study repetitions” (ISR) assumption: That is mistakenly assuming the correct frequency analogy in Neymanian decision theory is multiple measurements on one object. As Neyman took pains to explain, it isn’t the correct analogy. As with failing to distinguish p from P, it’s an incorrect carryover from the less structured Fisherian theory. Sure, multiple measurements of one finger can be done with one ruler to evaluate its “measurement reliability” and that is what the observed p is supposed to reflect; but try getting ISR from real human studies: The thought experiment is tremendously ill-defined physically – what exactly defines “identical”? What is to be held constant and what is to be left to “vary randomly”? Do we have to get down to every quantum state of every particle at the ill-defined moment of treatment administration? The ISR condition is just a way of papering over a gaping hole in the logical connection of sample spaces and sampling distributions to reality. That can be remedied by elaborate discussions of the physical context (as was the tradition in Fisher’s time) but that requires a lot more example details than typical stat teaching offers. ISR caused Fisherians some grief as seen in the debates about conditionality and the likelihood principle. The Neymanian solution was a mathematically elegant one (as usual for Neyman as opposed to Fisher): Cut the assumptions for defining error rates down to a minimum, in particular allowing more realistically that studies are varying in many ways, and we are simply trying to find a random variable P that satisfies the uniformity requirement under the hypothesis we are testing, while concentrating downward maximally under a specified alternative of concern. None of this frees us from the context but at least it lessens the dependence of the connecting story on something as absurd as ISR.

          None of this is to deny the utility of the Fisherian view, only to point out its deficiencies from the Neymanian perspective and get unlocked from it as the only way to formalize frequentism. The deficiencies of the Neymanian view are legion and I would tell KW to get unlocked from it just like I’ve been berating you to unlock from the Fisherian one. To say nothing of the varying deficiencies of each of the 46,000 kinds of Bayesian views…

        • Regarding the “identical study repetitions”, I think this concept is consustantial to the definition of p-values. Whether you want to directly define P( S(X)>s | H ) or derive it in tortuously from error probabilities, you need a sampling distribution from hypothetical alternative realizations of the data conditional on the full model (including H).

          My understanding is that you don’t want these “hypothetical alternative realizations of the data conditional on the model” being refered to as “identical study repetions” because you think that makes people unable to grasp the frequentist error rates that can be expected from a sequence of unrelated studies.

          These “repetions” are conceptually different, in a different sampling space. I don’t think one has to distort the definition of p-values to explain how are they useful to reason about a sequence of experiments if (and this is a very big if) the p-values are “true” (every assumption is correct, or they are u-values, or whatever).

          This is what my coin example was intended to illustrate. “This coin is fair” is a statement about this particular coin and the random variable modeling flips of this coin and the sampling probability of hypothetical toss repetititons for this coin. This is not a “too restrictive” definition of coin fairness, it’s the only reasonable definition of coin fairness. You cannot define the fairness of a coin from the sampling distribution of other coins.

          If you *assume* that a coin is fair, you can say many things about a sequence of tosses. You’re no longer talking about what makes the coin fair, you’re talking about what a fair coin does. You’re looking at a random variable consisting of fair tosses. And everything works just the same if instead of a fair coin you have many fair coins. You can reasong about the “Platonic ideal” of a fair coin. This is what a p-value gets you, but this is not how a p-value is defined.

        • “Yet some statisticians have the temerity to blame users for being confused. So when statistics presents itself as a mature, settled mathematical field (as in KW and typical textbooks) I think it’s more like child pretending to be adult (except not the least bit cute). My view is that the “toolkit” approach to teaching is part of the solution to this problem.”
          +1

  13. Carlos: We can agree to disagree on some of these fine points and agree on the rest. I’ll just repeat that you are arguing for a point of view which I once taught uncritically (not the only one I did that with – for a year or so in the early 90s when I introduced Bayesian logic in my courses a few colleagues mistakenly thought I was Bayesian), but which is treated as wrong or misguided from other points of view. Through the conduct of meta-analyses I became convinced that those other viewpoints can be useful, including a more Neymanian one that can be seen as implicit in some of the meta-analysis literature (even though I find Fisherian and Bayesian viewpoints more useful in single study analyses). That is why I was happy to play Devil’s Advocate for Kuffner & Walker.

    Anyway, you hold a very critical view of Kuffner & Walker’s new TAS paper so I seriously urge you to write a letter to the TAS editor presenting the objections you’ve given here. (I occasionally write letters to TAS – unlike med journals they seem eager to publish statistical criticisms, albeit their turnaround time is an order of magnitude longer.)

  14. Shravan:

    My poor scholarship comment was referring counter-factually to myself – if I was to write for a psychology audience and be aware of originating sources by psychologists and not cite them.

    And I was mostly going with Steve G’s comment and the few pages you asked everyone to try to read. Having looked at the references there is more than ample citations for the statistical literature.

    Additionally, I was mostly paraphrasing CS Peirce who I recall argued for the value of having all (non-redundant) originating sources cited to make science more communal and connected. Now, Peirce often did not have access to many journals for extended periods.

    By the way, Driver, M. J., and Streufert, S. (1969). Integrative complexity: an approach to individuals and groups as information-processing systems. Administrative Science Quarterly 14, 272–285. http://www.stat.columbia.edu/~gelman/research/published/authorship2.pdf

    • I’m reading Peirce’s biography by Brent. Does this mean that misfits in academia are the originals? What a complicated life, given what I have read and from spending so much time as kid with academics. Such eclectic thinkers are the fount for expertise per se.

      • Peirce argued the opposite – that it disconnected him from others and allowed him to get too far along in his work making it difficult to get his ideas across. He argued more frequent interactions (mostly publishing papers) would have eased the communication burden.

        • Keith: Thanks for keeping us aware of Peirce. From what you and others have written, I have the impression that (but for a quirk of history) Peirce might have been a founder of modern academic stats – and that it’s worth pondering what his theory might have looked like compared to what became the received theories. I wonder if you’ve done that speculation or read about it and can share any thoughts about it?

        • Certainly some speculation.

          I received this comment from someone who (privately) reviewed Andrew and my draft of a philosophy paper 1.

          “It might well provide me with a project as a Peirce scholar, since you provide several Peircean reasons for a Peircean to be open to some Bayesian innovations to which the man himself was inveterately hostile.”

          I think it would be critical to look for material by Peirce written after 1900 and I not sure how much there is on stats relevant topics and available yet. Now a “project as a Peirce scholar” probably means (many?) months studying original materials. Likely would be interesting, but likely best for when I retire ;-)

          One thing I did check out somewhat was whether Peirce was aware/thought about priors other than non-informative. Now, Stigler argues that Francis Galton had a very different variation on Bayes(i.e. a prior which represents something instead of nothing) which was mostly overlooked until quite recently.

          I have looked a bit for any Peirce mention of that work by Galton – specifically the two stage quincunx representation of Bayes – with out much success. (Peirce did write a paper on the logic piano a couple years before Galtion’s quincunx which covered the machine reasoning aspect of the quincunx and that might have made it less interesting for him to write about it.)

          1. http://www.stat.columbia.edu/~gelman/research/unpublished/Amalgamating6.pdf

        • The amalgamating draft was dated 2017.

          The earlier link in this thread was Convincing Evidence 2013 – but that has been published.

          Planning to work on the amalgamating draft this summer as soon as the renovations on my house are done.

        • Hi all. I am the Peirce scholar to whom Keith referred. I am very interested in these issues as they arise in my field, though I have let my math skills atrophy too much to be an efficient consumer of this material. But, as Sander suggests, I do think that Peirce provides a nice example of someone mathematically and philosophically sophisticated wrestling with the strengths and weaknesses of both Bayesian and classical approaches before they became more or less standardized. For that reason, he offers a nice opportunity to look at some of these issues afresh.

        • Thanks Jeff – on this topic I’m strictly in the student role and hope to learn more…
          I was especially curious as to whether at some point Peirce anticipated the sort of perspectival and fusion philosophies combining frequentist and Bayes ideas that emerged in the later 20th century (key articles I know on that by Good, Box, and Cox started coming out when I was still a student in the 1970s, just in time to save me from my own dogmatic inclinations).

          Keith said:
          “The earlier link in this thread was Convincing Evidence 2013 – but that has been published.”
          where? what’s the citation?

        • How cool. I born into a family that was in thick with the Cambridge Apostles. So I’m quite interested in what Frank P. Ramsey, purportedly one of the Apostles, drew from Charles S. Peirce. I also think it would be a fun effort given that approaches have become so narrow and stridently handled. I am surprised that cliques have now formed as a consequence. Good that Sander thought to ask Keith.

    • Great; thanks for this note Keith. I’ve been reading Bayesian clinical trials books (e.g., the Spiegelhalter et al book) and realized there is a lot of stuff I was unaware of (e.g., Kass’s papers on sequential trials and eliciting expert opinion, and the use of ROPE in the 1970s by frequentists). I think I have read the paper you linked to.

      I find this eliciting expert opinion stuff fascinating, but this may be because I was exposed to it at Sheffield and am now biased in its favor.

  15. Thank you to everyone for this discussion! I am learning a lot…

    My favorite paper is also Greenland, S. (2017), “The need for cognitive science in methodology,” American Journal of
    Epidemiology, 186, 639–645, and in that spirit, I’d like to offer an idea. It is speculation, but it somehow concerns the psycholinguistics of null hypothesis statistical testing (NHST).

    On pages 3-7, the authors describe NHST in terms of setting up a null hypothesis and rejecting it to adopt a very specific alternative hypothesis (p. 6):

    “At this point, the NHST procedure switches to an informal reasoning process: we
    assume, post-hoc, that the maximum likelihood estimate ȳ that we happened to get from our
    data can now legitimately replace the infinity of possible values that we posited when we
    stated our alternative hypothesis.”

    Now, there was an older literature in psycholinguistics about how readers understand negation in sentences, much of the work done by Herb Clark, and it had connections to many aspects of reasoning and inference. The model of negation-processing was that when readers understand a negative sentence, they at least briefly entertain a version of that sentence without the negation. In linguistics we might call that a presupposition.

    I wonder whether the negation of a null hypothesis is leading to a cognitive bias to entertain a presupposition. In the case of NHST, the presupposition is something like a contrast set: The null difference is associated with the negation, and this leads us to think there is a relevant set of alternatives. In psycho/linguistics we call this “association with focus”. I wonder whether we consider these alternatives *rather* than questioning whether the data and associated assumptions of the test are appropriate, as outlined by Sander Greenland in some of his comments above about the embedding model Ak.

    As the authors write (p. 6):
    “In other words, NHST does answer a question, but it answers a question that we don’t really want an answer to.”

    It might be that NHST puts the focus on the point null, and in the ensuing negation of that point null, we are led to consider alternatives. Since our ML estimate is available as the alternative, it seems to follow as a counter-assertion to the recently-denied null hypothesis. If we don’t put the auxiliary assumptions in focus, then they are grouped in with the accommodated presupposition of the test.

    I don’t know of any direct evidence for this idea. But I would be fascinated if something like this was going on. I can recommend Horn L. (1989). “A natural history of negation” and Clark’s early work, if you are interested in the linguistics of this.

    • Hey Doug, good to hear from you! This is the first time I saw an analysis of NHST from the pragmatics point of view.

      So the analogy is when we put contrastive focus on Mary in:

      Only *Mary* bought a book.

      This evokes a contrast set, {John, Harry, Dick, Tom,…}. So when we respond with:

      It wasn’t Mary who bought a book

      one could continue with

      Some subset of John or Harry or Dick or Tom or … did too

      but instead we continue with

      Therefore it was Tom bought a book.

      Sounds about right :)

        • That’s what Andrew means when he says that accepting a specific alternative doesn’t make any sense, the way NHST is used. One thinks that one has some information (the sample mean) on who actually bought the book, but that information can be (as Andrew would say) “super-duper biased.”

        • Shravan

          Re: That’s what Andrew means when he says that accepting a specific alternative doesn’t make any sense, the way NHST is used. One thinks that one has some information (the sample mean) on who actually bought the book, but that information can be (as Andrew would say) “super-duper biased.”

          Binaryness is a habit that hard to abandon when necessary.

      • Yes, and maybe more importantly, we also accept the presupposition that there *is* a book at all.

        Another example (with apologies to xkcd):

        (1) Jelly beans do not cause acne

        Maybe the appropriate response should be:

        (2) Why would you even think that?

        • Well there is a stronger presumption of a book in tow, even without more info. Why then even raise it. Bu yes we can’t necessarily assume there is one.

  16. (I’m trying this again because my first attempt didn’t show up..)

    Thank you to everyone for this discussion! I am learning a lot.

    My favorite paper is also Greenland, S. (2017), “The need for cognitive science in methodology,” American Journal of
    Epidemiology, 186, 639–645, and in that spirit I would like to offer an idea. It is speculation, but somehow it connects psycholinguistics to null hypothesis statistical testing (NHST).

    The authors describe NHST (p. 3-7) as a procedure of setting up the null hypothesis, possibly rejecting the null, and if so considering an alternative (p. 6):

    “At this point, the NHST procedure switches to an informal reasoning process: we assume, post-hoc, that the maximum likelihood estimate ȳ that we happened to get from our data can now legitimately replace the infinity of possible values that we posited when we stated our alternative hypothesis.”

    Now, there was a literature on negation in psycholinguistics, much of it by Herb Clark, and it has connections to many aspects of reasoning and inference. There was a model for how people understand negation, and it proposed that readers briefly entertain a positive version of the negated sentence. We might call that positive version a presupposition.

    I wonder whether NHST, the way we currently do it, leads to a cognitive bias to entertain a presupposition, *rather* than to question other aspects of the data analysis.

    It seems that the NHST puts an emphasis (a “focus”) on the point null. We entertain a contrast set of alternatives when that null is rejected (“association with focus” of the negation), and the available ML estimate seems to fill the role of completing the presupposition. We seem to do this instead of questioning the auxiliary assumptions discussed above by Sander Greenland in his observations about the embedding model Ak.

    I don’t know of any direct evidence that this is going on. But it would be fascinating if so. If you’re interested in the linguistics of negation, I would recommend Horn, L. (1989) “A natural history of negation”.

    • Horn’s book sounds interesting.

      I think that basic course in logic, for the sheer exercises in syllogisms, symbolic logic, causal modeling and inferences sharpens one’s discernment of content. Then one should go on to wider critical thinking curricula as well.

      I am thumbing through the two statistics books I mentioned earlier. It seems to me that there needs to be a much thorough grounding in logic, bottom line. These books don’t really include that component for I do think it would require special treatment to make sense across the fields that apply statistics.

      And I too requested the Sander’s paper a year ago, And to this day, I am thinking how to achieve its true potential practicially.

        • Sander LOL You have been so smashingly prolific. I recall how I was wondering this due giving that lawyer a taste of his medicine.

          You could easily write 670 pages within a year or two. And you should produce a book, as I urge you to present your work to the general public.

          So in short I think you are very impressive debater. How’s that?

        • As Andrew would say, no, no, no, no, no, no, no (I know Sander was kidding). This is a very famous, classic text on the linguistic aspects of negation, by a famous linguist, Larry Horn! Negation is really amazing. For example, when you say

          I don’t think that John meant what he said.

          you are saying “I think that John didn’t mean what he said.” How does that meaning get assembled?

          Another cool example is negative polarity items: you can say “I don’t have a red cent” to mean you have no money, but if you remove the negation, you only get a literal meaning: “I have a red cent”.

          Then there’s meta-linguistic negation. It just goes on and on. There are some astonishing puzzles surrounding negation there.

        • Shravan:

          And, in French, << T'inquiète >> means its opposite! << T'inquiète >> is short for << T'inquiète pas >> which is short for << Ne t'inquiète pas >>. I just think that one’s hilarious.

          And, yes, I know there are lots of these in English too. These things are just particularly fascinating to me when they occur in a foreign language.

          Paradoxical negations also occur in science. For example, a non-statisticially significant p-value going in the wrong direction is interpreted as meaningful positive evidence! There’s actually a connection to linguistics and psychology here, I think! Remember Grice’s principle that utterances are intended to convey meaning? Similarly, there’s an attitude that all data, no matter how noisy, can lead to strong conclusions. If you start with the conviction that there is important and generalizable meaning in your data, you will find it, one way or another.

        • > “T’inquiète” is short for “T’inquiète pas” which is short for “Te n’inquiète pas”.

          The last one should be “Ne t’inquiète pas”.

        • “there’s an attitude that all data, no matter how noisy, can lead to strong conclusions. If you start with the conviction that there is important and generalizable meaning in your data, you will find it, one way or another.”
          +100

        • Shravan

          I just read a few pages so far. It does look interesting & well written. However, I must say, that for the purposes of statistics, a command of basic and intermediate logic would be more than sufficient. I was just surprised that it was that long. I think Sander is a punster. So I got his joke THIS TIME.

        • From Bob Carpenter’s ‘Type-Logical Semantics’ (p. 13, section 1.2.2 ‘Presupposition’):

          “Often presuppositions are quite subtle to address; it is common to find presuppositions in political debates, advertisements, and other manipulative language.”

        • “Often presuppositions are quite subtle to address; it is common to find presuppositions in political debates, advertisements, and other manipulative language.”
          – That’s one of the biggest understatements I have ever seen!
          “Statistical inference” is always full of presuppositions and highly manipulative language (“significant,” “confidence level,” “unbiased,” “uniformly most powerful,” “optimal,” “coherent,” “probable,” on and on).
          You could say most of what I’ve written to Carlos Ungil on this page is ragging on about that largely unrecognized problem, which is the 8,000-lb Tyrannosaur behind the curtain at the ASA/RSS statistical party.

        • Doug

          Re:From Bob Carpenter’s ‘Type-Logical Semantics’ (p. 13, section 1.2.2 ‘Presupposition’):

          “Often presuppositions are quite subtle to address; it is common to find presuppositions in political debates, advertisements, and other manipulative language.”
          ————-
          Is that the highlight of the book? And the book is going for about $150 on Amazon.

        • Sameera, you wrote “Is that the highlight of the book? And the book is going for about $150 on Amazon.” No, by no means. It’s an amazing book, well worth reading if you are into formal semantics/categorial grammar. You can get it from MIT Press for 53 USD: https://mitpress.mit.edu/books/type-logical-semantics

          But I can scan it for you if you want, I have a copy.

        • Shravan,

          That is very kind of you to offer to scan. Thank you much. No need. I can purchase it when ready to read it. I have a reading list to get through now. I’ll be sure to look at before I buy it. What did you like about it specifically?

        • Sameera, my memory is that the book is a very good introduction to a version of a theory of syntax and semantics called categorial grammar. How does one assemble meaning from a string of words that are arranged in a particular order. How is rhe meaning of the string „John loves Mary“ assembled? Loves can be seen as a two place function, like the + symbol, and assembling rhe meaning is then just function application. Lambda calculus from math logic. There is some cool stuff there, like how do you build the two meanings of „every man loves a woman“.

        • Shravan,

          Now that’s interesting subject matter to cover in a book. I will place it on my reading list. MIT has a very cool catalogue of books. Thanks again for reinforcing its value Shravan.

        • Sander:

          I guess i was echoing Tukey’s famous line, “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”

        • Andrew:
          We owe you yet another debt of thanks for the reminder and the strong rephrasing!

          Now if only the blog software weren’t so hostile to my math…(see my last post on this page).

  17. Sander, I think we mostly agree, except that you say I’m confused and I don’t think so. But if I’m confused, what do I know? Maybe some day I’ll be enlightened…

    Let me just split one last hair:

    > Likewise when we discuss the concept of decision errors from using an NP hypothesis test, we are not talking about the error in using one observed p compared against a fixed alpha, we are talking about how (Bernoulli) errors from such comparisons distribute over multiple measurements from P over multiple nonidentical studies.

    I would say that in decision theory the relevant random variable is R, a binary output representing the rejection/non-rejection of each experiment. In the same way that P is a function of X, and P is by construction uniform on [0 1] under the model (hypothetical ISRs assuming the null hypothesis is true), R is a function of P, and R is by construction distributed as p(reject)=alpha,p(non-reject)=1-alpha under the model. But of course p-values have other uses, apart from N-P hypotesis testing. In that case I that the frequentist properties of p-values from different experiments are conveniently given by their common uniform distribution (which provides a useful abstraction, but nevertheless it’s the result of a calculation based on the hypothetical ISRs for each model).

    I’ve never taught p-values, but I’m not sure that a definition based on decision theory arguments, mapping from values of the statistic to rejection regions and back to critical alpha values, makes the concept easier to understand. If the objective is to do N-P hypothesis testing, I find it easier to show that when the p-value is calculated in the usual way the rejection rule p<alpha has the required type I error probability. If the objective is not to do N-P hypothesis testing, the usual definition is much easier than the rejection-regions based definition. And one can proceed from there, with the observed p-values being realizations of a uniformly-distributed random variable if all the assumptions for each experiment hold.

  18. Carlos: We’re getting closer, but as far as I am concerned this is still wrong in the strict sense of mathematical logic, and a misleading description of typical practice (despite being common teaching):
    “it’s the result of a calculation based on the hypothetical ISRs for each model”
    – No it isn’t. Statistics calculation are straight from a probability model that requires no frequency interpretation as claimed by ISR. Again, as far as the math is concerned, ISR is a completely unnecessary reification.
    Now, ISR encourages visualization of a perfectly repeatable experiment, which is OK for idealized physics labs. But even in quantum-diffraction experiments ISR is not necessary for the statistics, as the distributions come directly from the physical theory, and the theory does not say you actually have to repeat any observation; it only tells you what patterns will emerge when you do.
    For the reasons I wrote before, by the time we get to human medical research, ISR is usually hopelessly muddled and misleading. Occasionally, the actual probability model may have been deduced from some simultaneous causal (structural) model for treatment assignment and response (justifying ISR as the hypothetical thought sequence you envision); but more often the model is just an off-the-shelf conditional-prior probability specification for the data given parameters. In the latter (prevailing) cases, repetitions of the actual study that generated the data play no role in the math assumptions or deductions leading to the observed statistics. Thus I say ISR is crippling when as here it hides the actual sources of sample spaces and distributions used to compute observed p, confidence limits, etc.: Conventionally assumed distributions (which are selected from whatever journals are willing to accept without close scrutiny, including most of what is in the commercial software manuals). The only way I see the ISR vision helping in this case is by making us contrast its visualized ideal conditions against the messy reality of our study, and thus question the model we’re using.

    • We were getting closer. Or we were realising how close we were, because I don’t think anybody has moved. But now I don’t really know where do we stand anymore.

      > Statistics calculation are straight from a probability model that requires no frequency interpretation as claimed by ISR. Again, as far as the math is concerned, ISR is a completely unnecessary reification.

      I declare myself innoncent of the reification charges. I was using ISR as a shorthand for the (hypothetical) sampling over the probability model with the assumption of the model being correct and just in the context of the p-value calculation.

      Anyway, if the probability model doesn’t “require” a frequency interpretation what interpretation of probability should be used?

      > repetitions of the actual study that generated the data play no role in the math assumptions or deductions leading to the observed statistics.

      The math can happily work with a model which bears no resemblance to reality, I agree with that. In that case, the p-value calculation is based on “repetions of the study that the model describes, which bears no resemblance with the actual study” and I completely agree that everything can be muddy and misleading.

      • > the theory does not say you actually have to repeat any observation; it only tells you what patterns will emerge when you do.

        That’s precisely what the “hypothetical alternative realisations” underlying p-values are about, as far as I’m concerned. You don’t actually have to do them, you just have to consider what would happen if you did.

  19. Carlos: OK, maybe we’re just down to the way we word (no not the old Barbra Streisand song).
    You asked:
    “Anyway, if the probability model doesn’t “require” a frequency interpretation what interpretation of probability should be used?”
    “Should”? Isn’t that at least half of what stat foundation arguments have been about? Which makes it all the more shocking when some write as if there is only one frequency theory or objective theory or subjective theory.
    There are far more theories than even most arguing for one or the other interpretation seem to recognize.
    And contrary to some dismissals as “academic”, the differences matter profoundly for teaching and practice
    – I hold that failure to teach the diversity is a contributing cause of lamented disasters in stat education and practice.
    A given statistical inference or decision theory is built on one or another of the interpretations, and when other interpretations are used to describe the theory content it can be quite confusing unless one is very clear that is what’s being done. We’ve witnessed the consequences of that confusion in the deformed chimera of Fisherian and Neymanian statistics (with outputs misinterpreted as Bayesian) that has been standard in basic teaching and practice since the middle of the last century.
    I can’t see any way out of that problem but to tell the messy truth to everyone (all students and users, not just PhD philosophy of statistics students): There are a lot of competing (and potentially complementary) probability interpretations and statistical theories, here’s a few major ones, here’s how they often get confused, here’s how to tell them apart, here’s how to keep them apart to prevent mistakes, etc.
    For example, in our exchange the sampling model was described as
    a) a math function that doesn’t care what the “probabilities” (unit measures) in it mean (math stat, founded on math probability);
    b) the consequence of physical laws or mechanisms (as in physics), sometimes described as “propensity”;
    c) the consequence or representation of some hypothetical sequence of studies (frequentist, including but not necessarily identical repetitions as in ISR); or
    d) a conditional-prior distribution family indexed by its free parameters (Bayesian).
    That might be a minimal list (knowing that we can get into even finer divisions, e.g., in (d) is the distribution only a personal betting function? if so, do we enforce countable additivity?).
    Even with only this basic classification, I could now roughly characterize my position as that ISR (a frequentist subtype) apparently originated in and was justified by appeal to known physical mechanisms (b), but in typical medical and social research is an ambiguous and misleading fantasy; for those applications, the Bayes interpretation (d) usually seem more appropriate, even though like (c) it could only appeal to mechanisms (b) or observed frequencies for empirical justification.
    (A more radical view I have often heard from Bayesians is that data-frequency models are never anything more than reifications of the data-prior interpretation (d), and now they point to Qbism for corroboration. They may be correct, but as Good argued for the physical case (b) it is practical to pretend as if physical laws are about real frequencies.)
    I want to keep all these views distinct and in mind – both as tools for applications and as means of decoding and criticizing monolithic descriptions of statistics and methods (which talk as if only one of these views exists or is always correct). So to the theme of “abandon statistical significance” I would add “abandon monolithic statistics.”

    • Sander, to be sure we are not losing the context I understand that you’re telling me that “from the standpoint of Neyman-Lehmann frequentist theory” it is “wrong” to say that the p-value and it’s uniform distribution is “the result of a calculation based on the hypothetical ISRs for each model” because “the calculation are straight from a probability model that requires no frequency interpretation as claimed by ISR”.

      Hence my question: what interpretation do you suggest for the probability that is used to calculate the p-value?

      I didn’t say “should” to suggest no other interpretation existed, it’s an open question but I don’t see how an epistemic interpretation is going to result in a properly defined p-value random variable which makes sense within the frequentist decision theory framework. At some point one would have to jump from the non-frequency interpretation to the frequency interpretation. To be clear, I think that the any physical interpretation (either strictly frequentist or propensity based) allows for a frequentist interpretation where it makes sense to refer to hypothetical replications.

      If we continute the conversation much longer, to “abandon statistical significance” and “abandon monolithic statistics” we will eventually add “abandon all pretense of knowledge”. Maybe we should write “lasciate ogne speranza, voi ch’intrate” on the door of every Statistics 101 class so students know what to expect…

      • Re: f we continute the conversation much longer, to “abandon statistical significance” and “abandon monolithic statistics” we will eventually add “abandon all pretense of knowledge”. Maybe we should write “lasciate ogne speranza, voi ch’intrate” on the door of every Statistics 101 class so students know what to expect…’
        —-

        After one year of reading articles in statistics, I concur with the ‘abandon all pretense of knowledge’. At least abandon what are patently absurd prognostications that are billed as empirically supported statistically.

  20. Carlos:
    You said “I don’t see how an epistemic interpretation is going to result in a properly defined p-value random variable which makes sense within the frequentist decision theory framework. At some point one would have to jump from the non-frequency interpretation to the frequency interpretation.”
    – I agree, I just disagreed with further automatically imposing ISR (identical study replication) condition as you did at the start of this thread. The random P variable can be a function of the entire probability model and data, not just of a sufficient statistic under a fixed model (for robustness it need not even use sufficient statistics, a point missed by KW).

    Then you said:
    “To be clear, I think that the any physical interpretation (either strictly frequentist or propensity based) allows for a frequentist interpretation where it makes sense to refer to hypothetical replications.”
    – I agree, I just disagreed with the idea that requiring those replications to be ISRs makes enough sense in human medical and social science studies to supply a sound interpretation.

    You asked: “what interpretation do you suggest for the probability that is used to calculate the p-value?”
    I’d reply: For what context and purpose?
    – For teaching, give all we’ve listed and maybe more, accompanied by explanations and illustrations of contexts in which each does or does not make sense or appear useful. Lessons taught would include why “ISR” is no substitute for thinking about the study idiosyncrasies and unmodeled uncertainties that every competent soft-science analysis must face. “Hypothetical study repetitions” not only miss all that but are unnecessary for some interpretations:
    p can be viewed as nothing more or less than a comparison of a prediction to an observation, one that ranges from 0=observations impossible under the model to 1=exactly as predicted, with ample preparation in logic to understand that “exactly as predicted” does not come close to proving the model is correct. And of course I would also transform p to Shannon information s = -log(p) base 2 so students could appreciate the “weakness” of p=.05 without dragging in contextually unsupportable spiked priors and the deceptive Bayes factors those produce (again see the 2009 commentary I cited last week).

    “Abandon all pretense of knowledge” strikes me as a superb starting motto for statisticians who want to do competent work in a new topic area. You don’t have to be Bayesian to realize that’s the case. I’ve seen mediocre and even awful medical-study analyses by celebrated statisticians who obviously couldn’t be bothered to glance at the background literature and develop a sensible model specification, and whose subject-matter coauthors lacked the statistics expertise to see the model their famed expert used was contextual garbage (ignoring vital external information and imposing nonsensical constraints).

    My view comes straight from the Boxian recognition that data models are prior distributions, even when their justification rests only on established physical models (never the case in medicine outside of randomization tests in the simplest short-term trials); such firm justification only makes them socially “objective” priors (not to be confused with “objective” reference priors, which are far removed). And Box was in engineering stats, which is a heck of a lot more grounded in established physical laws than is medical or social science. Here’s a related view:
    https://plausibilitytheory.wordpress.com/2015/07/10/what-is-a-statistical-model/

    As for abandoning hope: I hope statisticians will abandon all hype for oracular “inference” philosophies and canned interpretations that deceive instructors as well as users into thinking one can get sensible inferences and decisions out of software with no need to actually understand the context or even the meanings of the variable names, the parameters, or the tested hypotheses.

    • > p can be viewed as nothing more or less than a comparison of a prediction to an observation, one that ranges from 0=observations impossible under the model to 1=exactly as predicted,

      This seems a very strange statement to me. Under the model all the values of p are equally possible, aren’t they? p>0.99 is not more “as predicted” than p<0.01. What are you calling "prediction" and "observation"?

      I can make some sense of the "0=observations impossible under the model" as "observations [more extreme than the one we actually observed] [over hypothetical replications of the experiment or alternative realizations of the data or whatever formulation is acceptable] [are] impossible under the model". But surely this is not what you mean.

      I have no idea about how to interpret the "1=exactly as predicted".

  21. Carlos: Apologies for my unclear wording, I should have said “p can be viewed as a summary of a comparison of a prediction to an observation, one that ranges from 0=observations impossible under the model to 1=observations exactly as predicted.” p is then just a mapping from a distance between prediction and observation into the unit interval [0,1].

    Perhaps an example will help:
    Think of a primordial P-value like that from Karl Pearson’s chi-squared test of fit of a model yielding an expectation vector E for a vector of Poisson counts Y, with standardized-residual vector R = (Y-E)/sqrt(E) and / denoting element-wise division. The random test statistic is then D^2 = R’R.
    Furthermore, the square root of its observed realization is d = sqrt(r’r), none other than the Euclidean distance of the observed-count vector y from the model-predicted E using the standard-deviation units sqrt(E).
    The observed p is 1 when the distance d is 0 (y=E, which requires E to be all integers), whereas p approaches 0 as we approach nearly impossible observations under the model (as when d is over 10 times the square root of the test degrees of freedom) and is zero if some element of E is zero but the corresponding y element is positive.

    When the test is against an explicit embedding model, Y is replaced by its fitted value under that model so the distance d is now between the fitted counts from the embedding and test models. (Y is of course the fitted value for the saturated model.)

    Parallel approximate interpretations can be made via the usual expansions for Wald and likelihood-ratio statistics, although the latter also have an interesting direct interpretation in terms of information divergence between the embedding and test models.

    A key point to note in all these cases is that the geometric and information interpretations hold regardless of the probability interpretation, e.g., the fits and distances could all be from betting schedules on Y given E rather than frequencies. I know this sounds weird relative to standard teaching, but the math doesn’t care and neither do I, as I prefer the data-prior interpretation when the physical data generator leading to y is as uncertain as it is in my applications (so that the test becomes an unconditional model check – see p. 641-642 of my 2017 AJE article, The need for cognitive science in methodology, American Journal of Epidemiology 186: 639–645, available free at https://doi.org/10.1093/aje/kwx259).

    • Sander: It seems to me that “p can be viewed as a summary of a comparison of a prediction to an observation, one that ranges from 0=observations impossible under the model to 1=observations exactly as predicted” is still “wrong” because it’s unnecessarily restrictive.

      Your interpretation works in your example because the statistic underlying the p-value is a “distance” to the expected value of something. Then by construction the “far” observations result in low values of p. The value of p will be 1 when the observed value of that something happens to be equal to its expected value. The expected value for each element in the vector R is zero, which is also the most probable value, but it’s highly misleading to say that getting a vector of zeros is “exactly as predicted” when it’s in fact a very unlikely outcome.

      If you think this interpretation is valid in general, how can p be viewed as nothing more or less than “a summary of a comparison of a prediction to an observation, one that ranges from 0=observations impossible under the model to 1=observations exactly as predicted” in the following, much simpler example?

      The model is x~Normal(mu,1), the sufficient statistic is x, the p-value is calculated for the upper-tailed test of the null hypothesis mu=0

      Two scenarios:

      A) x=5, p=0.0000. “Observations exactly as predicted” What was observed? What was predicted?

      B) x=-5, p=1.0000. “Observations impossible under the model” What was observed? What was predicted?

  22. Carlos: Thanks for the interesting example, which shows that even the most basic-sounding question can touch on complex issues (or at least, that it doesn’t take much to provoke a book-length answer from me).

    You wrote:
    “The model is x~Normal(mu,1), the sufficient statistic is x, the p-value is calculated for the upper-tailed test of the null hypothesis mu=0.”
    – Your description here reflects a common usage and is even standard in some quarters…
    and it’s utterly inconsistent with any rigorous logic or geometry of P-values that I know of.

    As I will try and explain, I think your description displays yet another confusion perpetrated by standard stat teaching, especially in the way it becomes utterly unhinged from statistical geometry and thus generates verbal conundrums. And you may be relieved to know that some formal theories of P-values for composite hypotheses are logically inconsistent for exactly the same reason (e.g., see Schervish, TAS 1996).

    First, as with P and p we need to keep random variables and their realizations distinct, here X and x:
    The probability model is for the sufficient statistic X (like “average height”), not some observed-data summary x (like “1.6 meters”).

    Second, for reasons I’ll now attempt to explain, I regard “the upper-tailed test of mu=0” as a misnomer, potentially oxymoronic and definitely confusing as your example will serve to illustrate:
    There is a test which uses the upper tail of the statistic X and it’s of the one-sided hypothesis mu = 21 bits) in the event “D ge 5” against the full model M.

    You then wrote:
    “B) x=-5, p=1.0000. ‘Observations impossible under the model’ ”
    Again it looks like you accidentally switched statements in A and B, because X=-5 is practically impossible if H is false, as reflected by the fact that the point nearest x outside of H is 5 standard errors away.
    “What was observed?”
    X=-5 and D=0 are the observed events.
    “What was predicted?”
    As before, H predicts D=0 given A. That is fixed by the tested model M=H+A.
    Note that P=1 may only reflect violations of H and A that cancel each other enough to make X land in H despite H being false; that is why it is a logical error to claim that P=1 supports H or M.
    All we can say is that s = -log(p) = 0, so the event “D ge 0” supplies no information against H given A, or no information against M (this is Popper, day 1: you can’t logically prove hypotheses just by passing tests, no matter how severe the tests). To claim support requires accepting the auxiliary assumptions encoded in A; and even then to measure relative support requires a likelihood function over the model space defined by A, which is not always available (yet there are applications in which P-values can be constructed even though useable likelihood functions can’t, which is one practical reason why Bayesian methods have not taken over statistics).

    • Blog ate something here – “There is a test which uses the upper tail of the statistic X and it’s of the one-sided hypothesis mu = 21 bits) in the event “D ge 5” against the full model M.”

    • I don’t think the first point is very interesting. Writing each time X is the random variable, x is the observed value, S is the statistic which is a function of the random variable X while s is the observed value of the statistic corresponding to the observed value x, etc. is tiring and I think we already agree on that.

      I know composite hypothesis, higher dimensions, free parameters, etc. make things much more complicated. But trying to understand the simple cases is hard enough.

      Maybe one-tailed tests for location parameters are inconsistent with your geometric view of P-values. That may be a limitation of your interpretation, unless you want to say these are not valid P-values. In that case the issue is not just that the usual definition and interpretation of P-values is “wrong”, a bigger issue is that you’re talking about a more restricted concept. Reusing the name may not be the best idea.

      P-values are just a summary of where the observed value of a statistic stands in the context of the sampling distribution of that statistic. If the statistic is some kind of “distance” from something, the low p-values are “close” to it and the high p-values are “far” from it. But this is not always the case.

      I guess a similar issue will arise in non-parametric tests, like in a randomization test where the distribution of the statistic is over permutations of the data labels.

      By the way, the “hypothetical realisations of the data from the model” interpretation of p-values doesn’t really work for non-parametric tests. It does in some sense, even if the model is undefined, but we would have to ignore all the hypothetical replications where the data is not a permutation of the observed data. Even for me this seems unacceptable :-)

      > There is a test which uses the upper tail of the statistic X and it’s of the one-sided hypothesis mu = 21 bits) in the event “D ge 5” against the full model M.

      I don’t understand what you wrote, I think that several paragraphs may be missing… I can’t really comment on the rest of what you wrote.

      • Carlos:
        Yes the blog software ate several paragraphs. I reposted it a second and then a third time, the last one successfully by identifying the offending symbols.

        First off, I’m sure most people won’t find anything you or I write “interesting” so to say you “don’t find it interesting” means what? You are tired of avoiding category errors like that which characterized your initial objections to Kuffner & Walker? Speaking a foreign language acquired as an adult is tiring, and is not an excuse to butcher the language.

        Next: You said “Maybe one-tailed tests for location parameters are inconsistent with your geometric view of P-values.” No, that’s wrong and I don’t think you’ve fully grasped what I’ve written. Maybe the corrected version of my comments will help: The problem is that what you learned as “one-tailed tests for location parameters” is wrong, based on a faulty logical foundation that treats X=-5 as consistent with X being normal(0,1).

        You also said: “P-values are just a summary of where the observed value of a statistic stands in the context of the sampling distribution of that statistic.” Yes that’s a good description, so good that I already said that elsewhere on this page and cited a published discussion of the idea (Perezgonzalez 2015).

        Then you said: “If the statistic is some kind of “distance” from something, the low p-values are “close” to it and the high p-values are “far” from it.”
        Re-read your sentence: It’s completely backwards, just as in your previous example description.

        “But this is not always the case.” – So what? Without a real example that’s just playing the math-stat game of “I can contrive an exception” (never mind if it involves Cantor-type sets in Banach space). It’s the case in every single real application I’ve seen in 45 years in my field. If you have a real and sensible soft-science example of an exception I’d love to see it.

        As for permutation tests, I’m glad you can finally see my original point in that case, but you apparently are unaware of the causal-structural derivation of such tests. For a discussion of a paradigmatic example of that (Fisher’s exact test) see Greenland S, 1991: On the logical justification of conditional tests for two-by-two contingency tables, The American Statistician, 45, 248-251.

        For others who might still be following our debate, your example serves to illustrate why, after a lifetime of damaging whole literatures and confusing users with deceptive terminology and illustrations, statistics trundles on repeating these mistakes and then blaming users for all the confusion the field created. This goes along with a hidden presumption prevalent among textbooks and instructors that our authorities could not and did not make basic logic and terminologic errors in setting up and operationalizing their systems, and that even if they did we wouldn’t repeat any such errors. Well, sometimes they did make basic errors, and we repeat them and continue to do so; then, those who learned those mistakes as facts continue to promulgate them as truths along the lines of the laws of physics.

        The psychological commitment to these errors then runs deeper than the reach of logic and stronger than the motivation for reform: Witness the big F-you the RSS gave to reformers when it titled its glossy magazine “Significance,” which was then seconded by ASA when it became a co-sponsor. The consequences of denying the importance of these details are not pretty, yet could have been anticipated: As any programmer knows, sometimes a program with one little nitpicky semantic or syntactic error will work fine for a good while … until it crashes a 200-million dollar spacecraft or kills a driverless car passenger or allows 100 million names with social security numbers to be downloaded from a “secure” server. The big difference is that no one in statistics will be held liable for the damages.

        • By “not very interesting” I meant that we had already made that distinction clear before, and I didn’t think this point was relevant for the rest of your argument. Note that after telling me that “as with P and p we need to keep random variables and their realizations distinct, here X and x” you wrote repeatedly things like “P=.0000003 (from D=5)”. It was clear that you meant “p=.0000003 (from d=5)” and no harm was done.

          You’re right that I swtiched the labels for A) and B) in my example and confused “far” and “close” here. At least I was consistently wrong…

          You’re also right that I’m unaware about many things, thanks for the pointer.

          I can completely agree with your analysis of my example. It was very clear, thank you. The only “problem” is that you used your null hypothesis and your statistic, not the null hypothesis and the statistic that I gave in the example.

          Even if that statistic was not the best one to test that hypothesis (to put it mildly), it can be used to calculate a p-value. I could be mistaken again but I think that any statistic can be used to calculate a p-value given its distribution under the model and questions about optimality come later.

          I’m not saying that you should care about idiotic p-values, of course. And I have no interest in defending them either. Anyway we’ve diverged far away from the point I originally wanted to make. Thanks for all your book-length answers, I’ve learned a thing or two.

        • Thanks Carlos for the wrap up…
          I’m really not trying to get in the last word (or so I claim); but, having been trained in logic, probability, & stats at UC Berkeley with a strap across my back and a ruler to my knuckles, and then teaching a course on that for a quarter-century at UCLA (last numbered Statistics M243),
          “P=.0000003” is exactly what I meant:
          The event to which that refers is that “the random variable P took on the value .0000003.” In contrast, the statement you gave is p=.0000003, which says “the previously unspecified number p is in fact .0000003,” which is not an event in our sigma-field.
          Analogy: “P=.0000003” is like saying “the height we observed is 1.6 meters,” while “p=.0000003” is like saying “the number of meters is 1.6.” As programmers (I started in 1970 with Fortran IV on punch cards, so show some respect sonny) we ought to be alert to these type distinctions and their semantic implications. And we should all sleep easier knowing of the innumerable lives saved by obeying this particular distinction.

        • Sander

          Re: ’m really not trying to get in the last word (or so I claim); but, having been trained in logic, probability, & stats at UC Berkeley with a strap across my back and a ruler to my knuckles, and then teaching a course on that for a quarter-century at UCLA (last numbered Statistics M243),
          =====
          I hope no strap across your back. Oh man that’s TERRIBLE. Hope you have a good masseuse to relieve all that abuse.

        • I managed to convince myself that your consistent use in that message of capital letters to refer to the realized values of the random variables was not intentional… I should have known better.

          If the random variable is X and its realization is x, I expected you to say simply that the realization is x=5. Which seems to be what you did in previous comments, where you wrote the “observed realization is d = sqrt(r’r)” or “the observed p is 1”.

          But it’s true that in this case you wrote that “X=-5 and D=0 are the observed events.” I’m not sure if that means that the realization x is the event X=-5 or that the observed event is X=x where x=5 is the realized value or something else.

        • I’m so sorry you had to endure that… I promise to stop now and think about innocent bystanders next time :-)

        • Carlos

          How sweet of you. lol. Actually we were waiting for the Memorial Day weather to let u p for a cookout. No please continue. I have to compliment on you on your clear writing style. I’m enjoying the interchange.

          I love reading stuff that I don’t always understand. But sometimes it is very valuable to not be encrusted in a subject.

  23. Wow, it butchered my text, dropped a lot. Good thing I saved it locally.
    Here’s a paste in of what I wrote, let’s see what happens this time:
    Carlos: Thanks for the interesting example, which shows that even the most basic-sounding question can touch on complex issues (or at least, that it doesn’t take much to provoke a book-length answer from me).

    You wrote:
    “The model is x~Normal(mu,1), the sufficient statistic is x, the p-value is calculated for the upper-tailed test of the null hypothesis mu=0.”
    – Your description here reflects a common usage and is even standard in some quarters…
    and it’s utterly inconsistent with any rigorous logic or geometry of P-values that I know of.

    As I will try and explain, I think your description displays yet another confusion perpetrated by standard stat teaching, especially in the way it becomes utterly unhinged from statistical geometry and thus generates verbal conundrums. And you may be relieved to know that some formal theories of P-values for composite hypotheses are logically inconsistent for exactly the same reason (e.g., see Schervish, TAS 1996).

    First, as with P and p we need to keep random variables and their realizations distinct, here X and x:
    The probability model is for the sufficient statistic X (like “average height”), not some observed-data summary x (like “1.6 meters”).

    Second, for reasons I’ll now attempt to explain, I regard “the upper-tailed test of mu=0” as a misnomer, potentially oxymoronic and definitely confusing as your example will serve to illustrate:
    There is a test which uses the upper tail of the statistic X and it’s of the one-sided hypothesis mu = 21 bits) in the event “D ge 5” against the full model M.

    You then wrote:
    “B) x=-5, p=1.0000. ‘Observations impossible under the model’ ”
    Again it looks like you accidentally switched statements in A and B, because X=-5 is practically impossible if H is false, as reflected by the fact that the point nearest x outside of H is 5 standard errors away.
    “What was observed?”
    X=-5 and D=0 are the observed events.
    “What was predicted?”
    As before, H predicts D=0 given A. That is fixed by the tested model M=H+A.
    Note that P=1 may only reflect violations of H and A that cancel each other enough to make X land in H despite H being false; that is why it is a logical error to claim that P=1 supports H or M.
    All we can say is that s = -log(p) = 0, so the event “D ge 0” supplies no information against H given A, or no information against M (this is Popper, day 1: you can’t logically prove hypotheses just by passing tests, no matter how severe the tests). To claim support requires accepting the auxiliary assumptions encoded in A; and even then to measure relative support requires a likelihood function over the model space defined by A, which is not always available (yet there are applications in which P-values can be constructed even though useable likelihood functions can’t, which is one practical reason why Bayesian methods have not taken over statistics).

  24. Well that didn’t work. It appears to be jumping ahead due to a concatenation of symbols.
    Here’s another attempt to get it all replacing the offending symbols with words:

    Carlos: Thanks for the interesting example, which shows that even the most basic-sounding question can touch on complex issues (or at least, that it doesn’t take much to provoke a book-length answer from me).

    You wrote:
    “The model is x~Normal(mu,1), the sufficient statistic is x, the p-value is calculated for the upper-tailed test of the null hypothesis mu=0.”
    – Your description here reflects a common usage and is even standard in some quarters…
    and it’s utterly inconsistent with any rigorous logic or geometry of P-values that I know of.

    As I will try and explain, I think your description displays yet another confusion perpetrated by standard stat teaching, especially in the way it becomes utterly unhinged from statistical geometry and thus generates verbal conundrums. And you may be relieved to know that some formal theories of P-values for composite hypotheses are logically inconsistent for exactly the same reason (e.g., see Schervish, TAS 1996).

    First, as with P and p we need to keep random variables and their realizations distinct, here X and x:
    The probability model is for the sufficient statistic X (like “average height”), not some observed-data summary x (like “1.6 meters”).

    Second, for reasons I’ll now attempt to explain, I regard “the upper-tailed test of mu=0” as a misnomer, potentially oxymoronic and definitely confusing as your example will serve to illustrate:
    There is a test which uses the upper tail of the statistic X and it’s of the one-sided hypothesis “mu less than or equal to 0” (mu le 0), not of mu=0.
    mu=0 is a point hypothesis and thus leads to a “two-tail” test in order to use all the information available against the tested normal(0,1) model in the observation of the event X=x; that test is actually using the single upper tail of the absolute distance variable D=|X| or the squared distance D^2.

    To instead consider composite hypotheses (like mu le 0) we need more details from statistical geometry:
    For continuous data, an embedding model A defines the subset of the sample space that obeys the constraints imposed by A. (In regular extensions allowing discrete data, these embedding models are more accurately described as manifolds in the full-data expectation space that encode prior information about the data generator.)
    The tested model M defines a subset of the embedding model A, which is characterized by the additional constraints imposed by the test hypothesis H.
    In your example, M=H+A where H = {mu le 0}, A = {normal(mu,1) data generator} is the embedding model, and + is set union.
    The observed x is a signed distance of the observed sample mean from the boundary of H, with negative values in the interior of H.
    Intuitively, the information against H is zero when x is negative (falls in H), increasing without bound as the distance x increases into positive values. This intuition is formalized by the S-value S = -log(P) when using D = max(X,0) = pospart(X) as the test statistic, since that yields d as the Euclidean distance of “the data” (as summarized by x) from the test-hypothesis set H within the embedding-model space A of unit-variance normal generators.

    With that essential background, let’s reconsider your “Two scenarios”…
    You wrote:
    “A) x=5, p=0.0000. ‘Observations exactly as predicted’ ”
    No! Did you accidentally switch statements in A and B?
    Because X ge 5 is practically impossible if H is correct, as reflected by the fact that even the point in H closest to x is still 5 standard errors away from x.
    “What was observed?”
    X=5 and D=5 standard errors are the observed events.
    “What was predicted?”
    Since X is the estimator of mu, we could say H={mu le 0} predicts X le 0 given A, or that H predicts D=0 given A
    (you could add “apart from random error” or “stochastically” but that doesn’t clarify the basic geometry).
    Note that P=.0000003 (from D=5) may only reflect violations of A, not H; that is why it is a logical error to claim even P=0 refutes H unconditionally: Refutation is only conditional on auxiliary assumptions encoded in A (this is Popper, day 2).
    All we can say is that, given A, P=0.0000003 almost contradicts H, or that unconditionally there is a large amount of information (s = -log(0.0000003), over 21 bits) in the event “D ge 5” against the full model M.

    You then wrote:
    “B) x=-5, p=1.0000. ‘Observations impossible under the model’ ”
    Again it looks like you accidentally switched statements in A and B, because X=-5 is practically impossible if H is false, as reflected by the fact that the point nearest x outside of H is 5 standard errors away.
    “What was observed?”
    X=-5 and D=0 are the observed events.
    “What was predicted?”
    As before, H predicts D=0 given A. That is fixed by the tested model M=H+A.
    Note that P=1 may only reflect violations of H and A that cancel each other enough to make X land in H despite H being false; that is why it is a logical error to claim that P=1 supports H or M.
    All we can say is that s = -log(p) = 0, so the event “D ge 0” supplies no information against H given A, or no information against M (this is Popper, day 1: you can’t logically prove hypotheses just by passing tests, no matter how severe the tests). To claim support requires accepting the auxiliary assumptions encoded in A; and even then to measure relative support requires a likelihood function over the model space defined by A, which is not always available (yet there are applications in which P-values can be constructed even though useable likelihood functions can’t, which is one practical reason why Bayesian methods have not taken over statistics).

  25. OK that worked. Admin (Andrew?) can you delete the two butchered post attempts?
    at May 31, 2018 at 12:53 pm and May 31, 2018 at 1:54 pm

    – The last (3rd one at 2:02pm) looks OK.

  26. (nesting is too deep to reply directly)

    Sameera
    “Is that the highlight of the book? And the book is going for about $150 on Amazon.”

    No, not at all — It is mainly about categorical grammar. The quote was from the introduction (although it seemed appropriate here), mostly it was a shout-out to Bob.

Leave a Reply

Your email address will not be published. Required fields are marked *