Ron Kenett shares a summary of a recent online seminar on statistical significance and p-values. I wasn’t there myself, but I’ve inserted a few references here and there in the discussion below:
The slides and video recording of the event are available on the ENBIS media center. To access it, you need to register to ENBIS (it is free) at www.enbis.org.
https://enbis.org/media-centre/enbis-webinar-statistical-significance-and-p-values/The video recording is also available at https://www.youtube.com/watch?v=2mWYbcVflyE&t=10s
The event consisted of 3 talks followed by a round table discussion. Abstracts are listed in https://conferences.enbis.org/event/22/. Below are the talk titles and some points made in each talk.
Talks:
Talk 1: Daniel Lakens: P-values in a Neyman-Pearson Framework
– The process of claim making based on p< alpha does not depend on an individual’s personal beliefs. - When we “accept” or “reject” an hypothesis in a Neyman-Pearson approach, we do not communicate any belief or conclusion about the substantive hypothesis. Talk 2: Bernard Francq and Ron Kenett: P-value, s-value, B-value, D-value,… what else? Beyond the t-test and p-values - In contrast to p-values, Individual Success Probabilities (ISP) remain constant with growing sample sizes. Tolerance intervals are relatively unknown in pharma statistics. - The communication dimension in information quality is key to effective statistical analysis Talk 3: Stephen Senn: Trends towards significance - One does not see reports of “trends towards non significance”. Why only report “trends towards significance”? Neither make sense. - In the Neyman-Pearson approach, the alternative hypothesis dictates the choice of test statistic (H1 à Test statistic) and the justification of likelihood lies in power. For Fisher, the test statistic is chosen on the basis of experience (Test statistic à hypotheses). Likelihood is fundamental and power is irrelevant. [I don't like talking about "power," as it is tied too closely to "statistical significance," but I do think that design analysis is relevant to understanding how much we can learn from an experiment. — AG]Discussion:
Øystein:
The ASA statement on p-values has not discussed the topic “Recognize the difference between practical and statistical significance”. I think that is a very important topic since many statistically uneducated researchers, possibly more than 50% of them, are not aware of the difference, which often results in poor research. I guess that far more than 50% of the Norwegian physicians who know that the concept of p-values exists, believe that a small p-value resulting from a clinical trial implies that a clinically important finding is done, and that the smaller the p-value is the more clinically important is the finding. [Regarding “statistical significance and practical significance,” see here — AG]Elena:
I am a social scientist who recently started working in healthcare research. I see studies based on convenience sampling (e.g. online surveys or patients in treatment centers that were not selected at random) and they use significance testing (p-values, confidence intervals, even regressions). Is it correct to use such statistics on non-probability samples? [To the extent it’s ok to use these methods at all, yes, it’s fine to use them on non-probability samples. Which is a good thing, considering that in real life almost all we ever see are non-probability samples! — AG]Mircea:
May be worthwhile to have a discussion on the *different* types of p-values that one can report (some ok, like s-values, others less so, like p-rep); for instance the recent paper by Gibson (2021) 10.1080/19466315.2020.1724560 that suggest transformation to estimate replicability of a study (RP hat); not to mention others like harmonic mean p-values, calibrated p-values, and exact p-values.Ahmadou:
https://www150.statcan.gc.ca/n1/edu/power-pouvoir/ch13/nonprob/5214898-eng.htmSammeera:
I like Ian Harris’ views. Australian orthopedist. He has an interesting new book for general audiences.Yoav:
First, the 43 papers and editorial mentioned are not official ASA standing. For a later balanced point of view see: The ASA president’s task force statement on statistical significance and replicability
Yoav Benjamini, Richard D. De Veaux, Bradley Efron, Scott Evans, Mark Glickman, Barry I. Graubard, Xuming He, Xiao-Li Meng, Nancy Reid, Stephen M. Stigler, Stephen B. Vardeman, Christopher K. Wikle, Tommy Wright, Linda J. Young, Karen Kafadar Ann. Appl. Stat. 15 (3), 1084-1085, (September 2021) DOI: 10.1214/21-AOAS1501. [I’m not sure what it means to say that this later statement is “balanced,” but in any case I think it has serious problems; see the commments of Megan Higgs and also my comments. I guess it’s good they didn’t ask Megan or me to be on that committee or maybe a report would never have been written! — AG]
Second, there are two statistical pillars to replicability: Addressing selective inference even when evident in the paper, and hunting out the real uncertainty, such as the lab by intervention interaction in animal preclinical trials.Senn:
Yoav raises a very interesting point about lab by treatment interaction. This is important when one moves from local causal analysis to prediction. A paper I like is Youden J. Enduring values. Technometrics. 1972;14:1-11.Senn:
I think that there is a danger in concentrating on individual patient measurements in assuming that these are true long term values. Nothing in Bernard’s original presentation proved that there was any long term difference between patients apparently above or below 140mm Hg. To prove that you have to measure patients more than once. Suppose that everything is within-patient variation? I disagree that the D-value can be interpreted in this way. See the comment by Greenland and Robins on Demidenko’s paper.Elena:
Interesting talks! Thank you. In financial statistics (or banking) we “suffer” from the same use and perhaps misuse or misinterpretation of p values illustrated. Often times people are very happy with p values approximately equal to zero (as in the first example from the talks).Zippi:
If you ask a medical doctor what is the minimal difference that would cause her to change patient treatment, they usually understand the question and can give an answer. My two cents.Eugeniu:
Hello everyone! Great talks! I have just a comment: The problem is that lot of information about the statistical analysis is tried to be summarized in one parameter. A statistical analysis cannot be summarized to one or a very reduced set of parameters. P-value is a useful parameter, but it is on of many parameters that can be used to make statistical analysis. That’s why statistical analysis is not a task, it is a job.Nicholas:
I don’t like ‘adjusted’ or ‘corrected’ p-values that aim to conserve the family-wise error rate, not only because they result in large Type-II error rates but also because I think it’s rarely possible to identify a natural, appropriate family of tests on which to base the adjustment. Moreover, such p-values are often conservative even on their own terms, due to positive correlations among the tests. Another good reason not to use them. Instead, I advise people just to report single-test p-values, but to be aware of, and open about, the multiple testing that they are engaged in. I *do* like the false discovery rate. However, I don’t regard this as a p-value of any kind, as it is not calculated conditional upon a null hypothesis. It is a fundamentally different concept, and this should be recognised. I think the distinction between hypothesis-generating and hypothesis-validating studies is useful in deciding how to deal with multiplicity: in the former, p-values can be of use for ranking hypotheses (as per David Cox’s point), and one need not focus too much on Type-I error rates, in the latter, there should be much more focus on the Type-I error rate. See Leonhard Held’s work on what constitutes replication of a finding: https://www.crs.uzh.ch/en.html . The point made by Cori about analyses of large data sets, e.g. in finance, from which all the p-values turn out highly significant, is an important one. This usually happens not because the effect apparently detected is important but because the model is slightly wrong: there is slight unmeasured confounding, or a slight departure from linearity, which makes some other effect significant because the sample is so large. In this situation David Cox’s argument that the p-value provides a calibrated measure of something worthwhile becomes harder to sustain. I don’t know what’s the right thing to do about this. I like shrunk estimates from mixed models, which take into account both the estimated magnitude of an effect and the uncertainty about the estimate. But I’m not sure how applicable these are to the high-significance-caused-by-very-large-n situation. [You might like our paper, Why we (usually) don’t have to worry about multiple comparisons — AG]Kenett:
David Cox presents four applications of p-values summarized below. See https://www.annualreviews.org/doi/abs/10.1146/annurev-statistics-031219-041051?journalCode=statistics and https://www.youtube.com/watch?v=txLj_P9UlCQ
“The first, corresponds closely to the considerations underlying the Neyman–Pearson theory of testing hypotheses. On the basis of data y, we have to decide either to accept or to reject H. The test statistic is chosen to maximize power. The endpoint of use is not the assessment of uncertainty in a conclusion but a decision to accept or reject. Here, a routine repetitive decision between acceptance or rejection is considered. In a second application, the hypothesis H is of specific subject-matter interest. In this context, the p-value is an objective measure of uncertainty, calibrated against performance under hypothetical repetition. In a third application, the hypothesis H divides the possible parameter values into two sets, for example, A giving larger mean response than B versus B giving larger mean. Have the data clearly established the direction of an effect? A two-sided test of significance may be used. In effect, the outcome of the test specifies the level at which a confidence interval for the difference contains only points of the same sign. A fourth application is mostly informal. Significance tests signal which apparently anomalous features of data are beyond those expected under the inevitably oversimplified initial model.” (edited to be self contained). The directional aspect of the third application of p values described by Cox is leading to a verbal description of claims. In general, Statistics has not addressed this type of inference. For a proposal invoking alternative representations of claims and Sign type errors see https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3035070 [and here! — AG]Kenett:
Three comments
1. Generalizability: When making a claim, ask the question. Is it generalizable? One states that a claim is generalizable on the basis of first principles, intuition or transportability analysis. This perspective encourages a constructive discussion on interpretation of findings. See for example https://pubmed.ncbi.nlm.nih.gov/34893185/ [This relates to hierarchical models. — AG]
2. Presentation of findings: Many findings are presented verbally. How do you do that? The statistics community has mostly ignored this question. A proposal based on alternative representations and is presented in https://link.springer.com/article/10.1007%2Fs11192-021-03914-1 [There’s been lots of recent work on displaying inferences graphically! — AG]
3. Clarification of terms: There should be a distinction between “Reproducibility”, “Repeatability” and “Replicability”. Without semantic clarity, the discussion is cacophonic. https://pubmed.ncbi.nlm.nih.gov/26226358/ . Reproducibility is referred to by R. A. Fisher as the ability to design an experiment which would produce similar claims. This is different from repeatability which aims at getting the same results in repeated measurements under identical conditions.
On this one:
“I see studies based on convenience sampling (e.g. online surveys or patients in treatment centers that were not selected at random) and they use significance testing (p-values, confidence intervals, even regressions). Is it correct to use such statistics on non-probability samples? [To the extent it’s ok to use these methods at all, yes, it’s fine to use them on non-probability samples. Which is a good thing, considering that in real life almost all we ever see are non-probability samples! — AG]”
The thing is that a significant rejection of the null hypothesis rejects the null model with all its implications. In fact testing does *not* assume the model to be true (otherwise how informative would it be to reject it?), and whatever the data are, randomly sampled or not, it may be of interest whether they look consistent with a certain model (note that all our models are “wrong”, so it may be informative to use a certain model even if we know it is violated in a certain respect).
It is however important when interpreting the result of the test that rejection can have reasons other than the alternative being true, so that in a situation like this, convenience sample and significant rejection of H0, it is very relevant to ask how the specific way the convenience sample was chosen may have violated the null model and led to rejection. (Note also that not all violations of the null model will lead to rejection,. so here we may learn something about the specific effect of convenience sampling in the study of interest.)
In line with Christian, I think we need to much upfront about the strength/credibility of analysis based on how wrong the assumptions are (the likely degree of mismatch between a possibly world represented by the assumptions versus our best grasp of the world).
This issue is hard to make precise but XL Meng has done it for random sample versus convenience sample surveys – a small random sample survey has less risk of misleading than a huge convenience sample survey.
Advice like “To the extent it’s ok to use these methods at all, yes, it’s fine to use them on non-probability samples” obscures such critically important differences.
Not quite there for other forms of analysis such a Bayes with flat priors versus weakly informative priors versus informative priors as well as with large/informative samples versus small/uninformative samples. Bayesian Workflow is just now progressing towards dealing with these.