Thinking about “Abandon statistical significance,” p-values, etc.

We had some good discussion the other day following up on the article, “Retire Statistical Significance,” by Valentin Amrhein, Sander Greenland, and Blake McShane.

I have a lot to say, and it’s hard to put it all together, in part because my collaborators and I have said much of it already, in various forms.

For now I thought I’d start by listing my different thoughts in a short post while I figure out how best to organize all of this.

Goals

There’s also the problem that these discussions can easily transform into debates. After proposing an idea and seeing objections, it’s natural to then want to respond to those objections, then the responders respond, etc., and the original goals are lost.

So, before going on, some goals:

– Better statistical analyses. Learning from data in a particular study.

– Improving the flow of science. More prominence to reproducible findings, less time wasted chasing noise.

– Improving scientific practice. Changing incentives to motivate good science and demotivate junk science.

Null hypothesis testing, p-values, and statistical significance represent one approach toward attaining the above goals. I don’t think this approach works so well anymore (whether it did in the past is another question), but the point is to keep these goals in mind.

Some topics to address

1. Is this all a waste of time?

The first question to ask is, why am I writing about this at all? Paul Meehl said it all fifty years ago, and people have been rediscovering the problems with statistical-significance reasoning every decade since, for example this still-readable paper from 1985, The Religion of Statistics as Practiced in Medical Journals, by David Salsburg, which Richard Juster sent me the other day. And, even accepting the argument that the battle is still worth fighting, why don’t I just leave this in the capable hands of Amrhein, Greenland, McShane, and various others who are evidently willing to put in the effort?

The short answer is I think I have something extra to contribute. So far, my colleagues and I have come up with some new methods and new conceptualizations—I’m thinking of type M and type S errors, the garden of forking paths, the backpack fallacy, the secret weapon, “the difference between . . .,” the use of multilevel models to resolve the multiple comparisons problem, etc. We haven’t been just standing on the street corner the past twenty years, screaming “Down with p-values; we’ve been reframing the problem in interesting and useful ways.

How did we make these contributions? Not out of nowhere, but as a byproduct of working on applied problems, trying to work things out from first principles, and, yes, reading blog comments and answering questions from randos on the internet. When John Carlin and I write an article like this or this, for example, we’re not just expressing our views clearly and spreading the good word. We’re also figuring out much of it as we go along. So, when I see misunderstanding about statistics and try to clean it up, I’m learning too.

2. Paradigmatic examples

It could be a good idea to list the different sorts of examples that are used in these discussions. Here are a few that keep coming up:
The clinical trial comparing a new drug to the standard treatment. “Psychological Science” or “PNAS”-style headline-grabbing unreplicable noise mining. Gene-association studies. Regressions for causal inference from observational data. Studies with multiple outcomes. Descriptive studies such as in Red State Blue State.

I think we can come up with more of these. My point here is that different methods can work for different examples, so I think it makes sense to put a bunch of these cases in one place so the argument doesn’t jump around so much. We can also include some examples where p-values and statistical significance don’t seem to come up at all. For instance, MRP to estimate state-level opinion from national surveys: nobody’s out there testing which states are statistically significantly different from others. Another example is item-response or ideal-point modeling in psychometrics or political science: again, these are typically framed as problems of estimation, not testing.

3. Statistics and computer science as social sciences

We’re used to statistical methods being controversial, with leading statisticians throwing polemics at each other regarding issues that are both theoretically fundamental and also core practical concerns. The fighting’s been going on, in different ways, for about a hundred years!

But here’s a question. Why is it that statistics is so controversial? The math is just math, no controversy there. And the issues aren’t political, at least not in a left-right sense. Statistical controversies don’t link up in any natural way to political disputes about business and labor, or racism, or war, or whatever.

In its deep and persistent controversies, statistics looks less like the hard sciences and more like the social sciences. Which, again, seems strange to me, given that statistics is a form of engineering, or applied math.

Maybe the appropriate point of comparison here is not economics or sociology, which have deep conflicts based on human values, but rather computer science. Computer scientists can get pretty worked up about technical issues which to me seem unresolvable: the best way to structure a programming language, for example. I don’t like to label these disputes as “religious wars,” but the point is that the level of passion often seems pretty high, in comparison to the dry nature of the subject matter.

I’m not saying that passion is wrong! Existing statistical methods have done their part to slow down medical research: lives are at stake. Still, stepping back, the passion in statistical debates about p-values seems a bit more distanced from the ultimate human object of concern, compared to, say the passion in debates about economic redistribution or racism.

To return to the point about statistics and computer science: These two fields fundamentally are about how they are used. A statistical method or a computer ultimately connects to a human: someone has to decide what to do. So they both are social sciences, in a way that physics, chemistry, or biology are not, or not as much.

4. Different levels of argument

The direct argument in favor of the use of statistical significance and p-values is that it’s desirable to use statistical procedures with so-called type 1 error control. I don’t buy that argument because I think that selecting on statistical significance yields noisy conclusions. To continue the discussion further, I think it makes sense to consider particular examples, or classes of examples (see item 2 above). They talk about error control, I talk about noise, but both these concepts are abstractions, and ultimately it has to come down to reality.

There are also indirect arguments. For example: 100 million p-value users can’t be wrong. Or: Abandoning statistical significance might be a great idea, but nobody will do it. I’d prefer to have the discussion at the more direct level of what’s a better procedure to use, with the understanding that it might take awhile for better options to become common practice.

5. “Statistical significance” as a lexicographic decision rule

This is discussed in detail in my article with Blake McShane, David Gal, Christian Robert, and Jennifer Tackett:

[In much of current scientific practice], statistical significance serves as a lexicographic decision rule whereby any result is first required to have a p-value that attains the 0.05 threshold and only then is consideration—often scant—given to such factors as related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain.

Traditionally, the p < 0.05 rule has been considered a safeguard against noise-chasing and thus a guarantor of replicability. However, in recent years, a series of well-publicized examples (e.g., Carney, Cuddy, and Yap 2010; Bem 2011) coupled with theoretical work has made it clear that statistical significance can easily be obtained from pure noise . . . We propose that the p-value be demoted from its threshold screening role and instead, treated continuously, be considered along with currently subordinate factors (e.g., related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain) as just one among many pieces of evidence. We have no desire to “ban” p-values or other purely statistical measures. Rather, we believe that such measures should not be thresholded and that, thresholded or not, they should not take priority over the currently subordinate factors. We also argue that it seldom makes sense to calibrate evidence as a function of p-values or other purely statistical measures.

6. Confirmationist and falsificationist paradigms of science

I wrote about this a few years ago:

In confirmationist reasoning, a researcher starts with hypothesis A (for example, that the menstrual cycle is linked to sexual display), then as a way of confirming hypothesis A, the researcher comes up with null hypothesis B (for example, that there is a zero correlation between date during cycle and choice of clothing in some population). Data are found which reject B, and this is taken as evidence in support of A.

In falsificationist reasoning, it is the researcher’s actual hypothesis A that is put to the test.

It is my impression that in the vast majority of cases, “statistical significance” is used in confirmationist way. To put it another way: the problem is not just with the p-value, it’s with the mistaken idea that falsifying a straw-man null hypothesis is evidence in favor of someone’s pet theory.

7. But what if we need to make an up-or-down decision?

This comes up a lot. I recommend accepting uncertainty, but what if it’s decision time—what to do?

How can the world function if the millions of scientific decisions currently made using statistical significance somehow have to be done another way? From that perspective, the suggestion to abandon statistical significance is like a recommendation that we all switch to eating organically-fed, free-range chicken. This might be a good idea for any of us individually or with small groups, but it would just be too expensive to do on a national scale. (I don’t know if that’s true when it comes to chicken farming; I’m just making a general analogy here.)

Regarding the economics, the point that we made in section 4.4 of our paper is that decisions are not currently made in an automatic way. Papers are reviewed by hand, one at a time.

As Peter Dorman puts it:

The most important determinants of the dispositive power of statistical evidence should be its quality (research design, aptness of measurement) and diversity. “Significance” addresses neither of these. Its worst effect is that, like a magician, it distracts us from what we should be paying most attention to.

To put it another way, there are two issues here: (a) the potential benefits of an automatic screening or decision rule, and (b) using a p-value (null-hypothesis tail area probability) for such a rule. We argue against using screening rules (or, to use them much less often). But in the cases where screening rules are desired, we see no reason to use p-values for this.

8. What should we do instead?

To start with, I think many research papers would be improved if all inferences were replaced by simple estimates and standard errors, with these standard errors not used to decide whether effects should be declared real, but just to give a sense of baseline uncertainty.

As Eric Loken and I put it:

Without modern statistics, we find it unlikely that people would take seriously a claim about the general population of women, based on two survey questions asked to 100 volunteers on the internet and 24 college students. But with the p-value, a result can be declared significant and deemed worth publishing in a leading journal in psychology.

For a couple more examples, consider the two studies discussed in section 2 of this article. For both of them, nothing is gained and much is lost by passing results through the statistical significance filter.

Again, the use of standard errors and uncertainty intervals is not just significance testing in another form. The point is to use these uncertainties as a way of contextualizing estimates, not to declare things as real or not.

The next step is to recognize multiplicity in your problem. Consider this paper, which contains many analyses but not a single p-value or even a confidence interval. We are able to assess uncertainty by displaying results from multiple polls. Yes, it is possible to have data with no structure at all—a simple comparison with no replications—and for these, I’d just display averages, variation, and some averages and uncertainties—but this is rare, as such simple comparisons are typically part of a stream of results in a larger research project.

One can and should continue with multilevel models and other statistical methods that allow more systematic partial pooling of information from different sources, but the secret weapon is a good start.

Plan

My current plan to write this all up as a long article, Unpacking the Statistical Significance Debate and the Replication Crisis, and put it on Arxiv. That could reach people who don’t feel like engaging with blogs.

In the meantime, I’d appreciate your comments and suggestions.

89 thoughts on “Thinking about “Abandon statistical significance,” p-values, etc.

  1. I absolutely think you should continue to boost David Salsburg’s article (says his nephew).

    One of the challenges is how to teach in a broad variety of fields when we know so many of the articles that grad students read use a p-value framework. If I had a dollar for every doc students without a hypothesis in research questions slip into null-hypothesis-testing language, … so part of this has to help new folks read the vast bulk framed in hypothesis-testing and p-value worship.

    • Binary mode of thinking undergirds debates on the Internet regardless of which statistical tools are used. There is even a predictable pattern to how debates go. Cass Sunstein has done a good job of elaborating on this topic. We are an argumentative happy culture. Then in my view the quality suffers.

    • David S. Salsburg was “Research Advisor, Department of Clinical Research, Pfizer Central Research” when he wrote that piece.

      As someone who is involved in bringing new pharmaceutical products to the market, he (somewhat unsurprisingly) finds that hypothesis tests invoke “too much conservaivism” and favours 75% intervals instead.

      • Michel:

        I think it’s ridiculous to make decisions about drug development and approval based on tail-area probabilities relative to straw-man null hypotheses. It’s just nuts. Given that the system exists, I can believe that people will adapt to it as best they can. But, stepping back a bit, the whole thing is ridiculous to me. I think a decision-analytic approach would make much more sense. Don Berry and others have written a lot about this.

  2. I would like nothing better than to read and collaborate in writing papers which adopt Andrew’s suggested approach of “…all inferences [were] replaced by simple estimates and standard errors, with these standard errors not used to decide whether effects should be declared real, but just to give a sense of baseline uncertainty.” I personally can’t even conceive that that becoming normative in the medium term, say the next 5-10 years. But it would be great.

    In the mean time, I think the best many of us can do is this. Write papers that would stand on their own as something much like the above-quoted paradigm and then add in the minimum amount of gratuitous “NHST” language and statistics necessary to get past the reviewers and gatekeepers and make it into print. My idea is a manuscript where you could take a marker and black out every p-value and every phrase referring to “statistical significance” and have a valid contribution to whatever field the research is in.

    • This is a useful exercise for everyone! Black out all p-values and references for p-values, then ask the reader if the conclusions still seem justified.

    • Brent:

      Yeah, like this paper where the reviewers forced us to count statistically significant results and so we were led, kicking and screaming, to include statistical significance in the first sentence of our Findings:

      Averaged across the ten project sites, we found that impact estimates for 30 of 40 outcomes were significant (95% uncertainty intervals [UIs] for these outcomes excluded zero) and favoured the project villages.

      I guess if we were really tough we could’ve refused and just withdrawn the paper from the journal, but we couldn’t quite bring ourselves to do that.

      • That’s a cool paper, it seems to meet the criteria I had in mind of having everything a reader could want or need (if she/he just ignores the statistical significance stuff).

        With respect, if that list of authors is for all intents forced into answering the “how many were significant” count right up front then there is indeed a long way to go to the promised land! I may be retired or dead before ever getting away with submitting a no-pvalue paper.

    • I’m worried that much of the keep-it-super-simple statistical advice being offered recently will make the field of Statistics appear to young people as being even more boring than it is already perceived to be. Of course data analysis, when labeled as computer or data science, is perceived to be super exciting. Nowadays, there is so much press about using mathematical methods to handle “Big Data” in an attempt to tease out complicated gene/environment interactions, mine through vast economic/financial data to make big money, understand complex neurological pathways to cure diseases, etc. etc. Sophisticated methods are all the rage such as “Deep Learning”, Bayesian stochastic partial differential equation modeling (SPDE), sophisticated artificial intelligence (AI) algorithms, and on and on. Young people, the really smart ones who are full of energy and eager to innovate, are drawn to academic programs in computer science, engineering, and data science with the hopes of making big impact. Grants in these fields abound, as this is where the future seems to be.

      Statistics, to some extent, seems to be going in the opposite direction. In the statistics realm, advice is given to analyze data with nothing more than “simple estimates and standard errors”. Innovation is squashed, as people much have vast leeway to create new algorithms, yet in Statistics such freedom to innovate is described as “Researcher degrees of freedom”, and dismissed as little more than an opportunity for scientists to make up whatever answers they want and effectively cheat. There are greater and greater pressures for statistical analyses to be mainly prescriptive, with preregistration of everything and with analyses conducted following predefined recipes using simple approaches, again as a way to limit researcher freedom which will hopefully result in findings that are “objective” and credible.

      Of course it’s exciting that such “mathematical analysis of data” (which was once, to some, the very definition of statistics) is growing and growing in popularity. It’s just that I find, in my work, that young people who are eager to enter such a field are more and more interested in the above mentioned comp sci-based degree programs, and less and less interested in anything labeled as “Statistics”.

  3. “Without modern statistics, we find it unlikely that people would take seriously a claim about the general population of women, based on two survey questions asked to 100 volunteers on the internet and 24 college students.”

    You know what’s funny? If you ask the investigators if their study generalizes to all women, they would immediately back off and say “No, no, no! This is just what WE found in OUR sample!”

    Their p-values aren’t about generalizing or even really testing any hypotheses, they are thresholds for “legitimized discovery”. Ugh.

  4. I think p value is a symptom, not really a cause, of low quality research being conducted in many fields. If all scientists are properly trained in reasoning, I feel that this problem would go away naturally. Otherwise, getting rid of p values will just lead to a different malpractice.

  5. This made sound silly, but I think that catchy titles and memorable phrases are important in advancing this.

    I use your (or at least I think you came up with it?) turn of phrase “the difference between significant and not significant is not itself statistically significant” often, and I think people respond to it almost immediately or after a brief explanation. It also sticks in the mind. You seem to be pretty good at coming up with these — garden of forking paths is also useful.

    I’d love to see pithy shorthand for some of your other arguments too. I think it really is helpful in communicating.

  6. What I like most about this is how it’s structured with the way forward. I’ve learned from your papers on the best approach moving folks into Bayesian models. It reads like being invited into the conversation/way forward instead of being yelled at.

  7. I like the suggestion under 8: “To start with, I think many research papers would be improved if all inferences were replaced by simple estimates and standard errors, with these standard errors not used to decide whether effects should be declared real, but just to give a sense of baseline uncertainty.” From my perspective as a non-statistician, the real problem is not scientists who can be better trained but the general public that ultimately need to make decisions based on studies that are being touted often by interested parties. How do we give the public better information that it can actually understand about the weakness of certain results. I think that not giving them a bunch of statistical mumbo-gumbo is a good start, but they are going to be feed that nonsense anyway by industry or political interest groups. I think data simulations always struck me as a way to show that the results that look compelling can be nothing but random error in a form that is easy to understand. Also, why can’t more scatter plots be shown. I think that educated people don’t know what to do with a correlation coefficient or a p-value. They just assume that the expert must know what she is talking about, but they can see from the scatter plot that the study contains a lot of noise, and begin to get skeptical.

    • Also, why can’t more scatter plots be shown.

      This is one of my heuristics. If they claim x causes y in the title or abstract, there better be a scatter plot showing the relationship between x and y. If missing, throw it in the bin.

      I think usually they are not shown because they do not look at all impressive, or even nonsensical like this one:
      https://statmodeling.stat.columbia.edu/2019/03/20/retire-statistical-significance-the-discussion/#comment-998067

        • Why is this such a rare view? Or is it. I also wonder who is the source of this problem. You may be right that the researchers are hiding the nonsense that they are peddling, but I was involved in a litigation of behalf of a drug company where the New England Journal of Medicine wouldn’t let the researchers put in a graph that would have give readers more context about a potential side effect that the company was later accused of hiding. I think that there must be some bias against graphs. It is bizarre.

        • Why is this such a rare view?

          I don’t think whether or not to use a scatter plot even qualifies as a “view” for most researchers. They are just mechanically plugging numbers into stats software and putting the resulting plots/tables in their papers.

          That is what makes it such a great heuristic, anyone who fails to do that has also failed to do all sorts of other common sense things.

        • Yes, medical journals have stringent word limits and numbers of figures and tables. Page space is expensive and editors are quite miserly with it. In my experience, it is rarely the case that you can meet those requirements without leaving out something that is material and important.

          Nowadays, you can put all of the omitted material in an appendix that is available online, but does not appear in the printed journal. On the one hand this is fair. On the other hand, I think it is uncommon for readers to make the extra effort to seek out the supplementary information unless they have an extraordinary level of interest in the topic. The result is that the “article” is reduced to being a “teaser,” sometimes presenting a misleadingly abbreviated view of the study, and only the ultra-diligent get to find out what really happened.

  8. For pithiness, one of my favorites is quoted in Tufte’s Beautiful Evidence (attributed to a paper by Bernard Berelson), summarizing 1045 grand summary findings of human behavior:
    “1. Some do, some don’t
    2. The differences aren’t very great.
    3. It’s more complicated than that.”
    I think that is a good point for beginning with the right frame of mind.

  9. It’s really hard to embrace uncertainty when your editors and readers are asking for certainty…. especially when uncertainty is all you have to sell. p-magic generates statistical significance, which is taken as the statistical surrogate of certainty. Maybe if we were just required to call it p-magic or “a sprinkling of p-dust” we could keep using p-values.

    “The experiment showed a mean effect of 3 with a sd of 1.1. After the sprinkling of p-dust, one can be confident in my result.”

  10. I don’t think the Amrhein et al. want to get rid of p values, which do provide some useful information. Furthermore, one of their issues is treating a p=.04 rejection the same as a p=.00004 rejection of the NH. I guess the information p values provides is related to the degree of noise for effects.

  11. I am confused by your comparisons of stats to CS and engineering.

    It is true that there are arguments about whether program statements should be terminated by semicolons, or to have static or dynamic types, etc. These are questions about “how to synthesize productively as a collective of humans”, which is a human endeavor with subjective opinions and tastes. But when it comes to analysis, every computer scientist can look at any program and tell whether it’s correct or not.

    The same applies to engineering or applied math in general. Given our knowledge of physics, every competent engineer can analyze what happened to those Boeing planes. You have a (man-made) model and a bit of data (plane crash) and you try to determine what in the model lead to that crash (or whether the model matches the actual plane, e.g. due to a manufacturing flaw, etc.)

    Stats is different. You can have different background assumptions and use different data in statistical analysis. There’s no One Way(TM) to do it.

    • I think Andrew is referring to the difference in design philosophies of programming languages. There’s “worse is better” (C, C++), Pure Functional Programming (Haskell, ML), The Code is Data (LISP), Computer Programs Should Look Like Line-Noise (Perl) etc etc

    • I like the comparisons of stats to CS and engineering.

      In engineering (by the way I did product engineering in my early days) there are always trade offs being made (the one a head engineer of a petroleum producer challenged me on was possibly putting in too many redundancies in air compressor stations to light flare stacks.)

      In CS the programs will be used by others, programmed by others, require training for those doing that and or evaluated by others.

      I think that brings up the following issues

      The normative facet (N) specifies what statistical techniques can correctly be applied in a given situation; it is potentially informed by the full body of knowledge that is mathematical statistics.
      The descriptive facet (D) comprises knowledge of how people think about statistical concepts, what messages they receive when inspecting a statistical presentation, and their statistical misconceptions and biases.
      The prescriptive facet (P) comprises knowledge about how to achieve successful statistical communication and education.

      Descriptive: (1) Decisions people make; (2) How people decide. Normative: (1) Logically consistent decision procedures; (2) How people should decide. Prescriptive: (1) How to help people to make good decisions; (2) How to train people to make better decisions.

      https://pdfs.semanticscholar.org/dab3/f6246beb6e42e29dab81a0428e2b058d905d.pdf

    • Koray:

      My point is that statistics, like CS, has regular flamewars, among leaders in the field, regarding fundamental philosophical issues that affect practical decisions. Such flamewars also are part of the social sciences (psychology, econ, etc.) but I don’t see them in physics, chemistry, biology, and so forth. One might think that stat and CS, being close to pure math, would not have these flamewars, but they do—which suggests to me that there is a social science nature to stat and CS, having to do with the interface between humans and technology, in a way that isn’t so much the case in most of the natural sciences and engineering. (Yes, there are flamewars regarding climate change, but I don’t see these flamewars as being about the science; rather, these are political arguments that have been dragged into the scientific discourse, in the same way that there used to be flamewars in biology regarding creationism, which was again not central to the field in the way that the stat and CS flamewars are.)

      • Andrew,

        You are correct that CS has a lot of math as well as a lot of conventions, human concerns, etc. And in those non-math, subjective areas, there are a lot of debates (up to flamewars). But, these flamewars don’t really hurt anybody but the engineers themselves. You can choose the least popular technology and spend an absurd amount of time to develop your product with it. But, thanks to the math part of CS, we can all analyze the product, conclude that it works and use it happily. There are no more flamewars here.

        When there’s a flamewar in statistics, it’s about what is a justified inference, and we’re all affected, not just the statisticians or the scientists. It’s a philosophical (and sometimes PR) crisis.

  12. “To start with, I think many research papers would be improved if all inferences were replaced by simple estimates and standard errors, with these standard errors not used to decide whether effects should be declared real, but just to give a sense of baseline uncertainty.”

    We did this with the default output for rstanarm models a few years back, and I think it has been a good decision. Although I like the default plots better because you can’t get caught up on whether the MAD_SD is half the posterior median (but you could get caught up on whether zero is outside the 90% credible interval).

    But I think the framing of this recommendation lends too much credibility to what other software is doing. In all those studies from PNAS or similar, the estimated standard error that someone’s computer software printed out was an estimate of the standard deviation of the distribution that would result if the estimator were applied to different randomly sampled datasets *conditional on the point estimate to the immediate left being the true parameter*, which shouldn’t be conditioned on. The point estimate to the left isn’t the true parameter and can be the result of chasing noise, so the estimated standard error isn’t even a good estimator of the standard deviation of the sampling distribution. Even if it were, I wouldn’t characterize it as establishing a baseline level of uncertainty about the parameter being estimated.

    A posterior standard deviation is conceptually closer to providing this baseline level of uncertainty, but it is still conditional on the rest of the model for the generative process as well as the particular data values being used. So, it is really more of a lower bound to the baseline uncertainty you are getting at.

    • This is an amazing comment, but also very depressing. Everything feels kind of hopeless when stated in this way. But maybe we have to communicate the sheer hopelessness of what we are trying to do when we compute an SE.

  13. Number 6 leaps out at me because, to me, it formulates the essential difficulty that people have with statistics: the weird concept of proving or disproving a null. To me, it will not be possible to reform statistics unless that knot of ideas is better unraveled and better put. It isn’t easy.

  14. The American Economic Association has taken a baby-step towards retiring p-values: “In tables, please report standard errors in parentheses but do not use *s to report significance levels.” (https://www.aeaweb.org/journals/aer/submissions/guidelines) With the effect likely being that people calculate in their heads whether estimate > 2*s.e. … I would also expect this if research follows Andrew’s suggestion that “all inferences were replaced by simple estimates and standard errors, with these standard errors not used to decide whether effects should be declared real”.

    Generally, our ability to cognitively deal with bodies of research findings is a real constraint. As researchers, we need to know literally hundreds if not thousands of empirical results. If there’s some (imperfect) categorization of these results into true / false which takes n bits, say, then adding an associated uncertainty level will take something like 2n bits. If you can only store N bits, you’ll need to decide whether to store the uncertainty level for a result you’ve already stored as true / false, or whether you’d rather store a next result that is (imperfectly) classified as true / false. I don’t think the trade-off will always come out in favor of storing the uncertainty level.

    Hence, people will likely always look for ways to map whatever statistical data they have into a true / false categorization, whether that is done in the research paper or not. (Of course, because paper doesn’t have the same memory constraints as brains do, they should include more detailed information about the uncertainty etc.)

    • I agree here (and I think many others).

      The way forward seems to be to acknowledge what you raise but try to stay continuous as long as possible with any categorization only happening at the latest stage possible. Also to never block access to the continuous stuff, especially permanently (as with a study reporting only p > .05).

  15. I think you need to consider the reactions of researchers to whom statistical significance is currently a gateway through which they have to pass to publish. If statistical significance is replaced by some other test – informal or formal, yes/no or a continuous spectrum – that test will be under the same pressure. If the test is informal, researchers with strong personal ties to the gatekeepers will be tempted to use those ties to get their research through. If the test is formal, researchers will be tempted to game it. This is not just a question about mathematics or statistics: it is also a question about how people will react to a change in the rules of the game they are currently playing.

    In some cases, the pressure is exerted in the other direction. If the effective functional replacement for statistical significance demands a much higher level of mathematical and statistical knowledge, or expertise with specialized software, it will be easier for a large organization to brush off evidence that should suggest that it is guilty of polluting the environment, increasing climate change, endangering people’s health, or discriminating against some classes of people.

  16. > 3. Statistics and computer science as social sciences

    Yes!

    I was thinking about the same thing, but is a slightly different way.

    Statistics is not just mathematics, it is applied mathematics. We have to map our pure mathematical concepts into the real world, and we do it with philosophy. So when we argue about statistic we don’t argue about math, we argue about philosophy (math -> reality map).

  17. I asked this in the previous thread but didn’t get an answer, so I’ll ask again: How do you conduct GWAS research without p-values? What are the alternative approaches and what are their benefits compared to p-values?

    • Your question was answered here:

      1) I don’t assume research like that should be conducted.
      2) If it should be conducted, I gave an example: sort your p-values and pick the top 10 to investigate further.

      *I note since your original question now you have added a layer of confusion by thinking people are complaining about p-values when they are actually talking about statistical significance.

      Anyway, the way p-values are being used in those studies is no different than ranking by effect size. WIth sufficient sample size every single gene will be “significant”, so they just adjust the cutoff to whatever is necessary to get about the “right number” of “significant” genes. The “right number” is apparently a couple hundred.

      I even found a quote from the GWAS literature saying exactly that:

      Sequencing studies lead to an increased number of low-frequency (0.5% < MAF<5%) and rare (MAF < 0.5%) variants, arguing for a more stringent statistical threshold for association testing in studies utilizing sequence data.

      https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4970684/#bib1

      That said, perhaps no one should run a study like that to begin with. The example of a successful replication provided to me turned out to be very strange and not at all convincing. Personally, I would just plug everything into a CART or neural network and use all the information instead of arbitrarily dropping some. This makes more sense since the effect of one allele will depend on the presence of another, etc.

      The skill of the model developed that way would be assessed by checking it on new/other data.

      • I explained why your answer was inadequate. I don’t see what “layer of confusion” I’ve added. When I speak of p-values in the context of GWAS, I’m of course referring to the use of a particular p-value-related decision rule, that is, the significance threshold of 5 * 10^-8.

        I’m not going to redo the same debate with you, but I’ll note that a) your quote refers to WGS data, not SNP data; b) the omnigenic model is nonsense; c) the “right number” depends on the phenotype and can be thousands; and d) GWAS focuses on additive effects which capture the vast majority of genetic variance–gene interactions contribute little variance and are basically impossible to study anyway.

        I genuinely want someone to describe a realistic alternative to current practices, not just some handwaving about neural networks.

        • I’m sure I’ve helped some biologist friends with GWAS data, but I forget exactly how this technique works. Describe the basics of the technique and I’ll offer you an alternative analysis.

          But note, of all the things you could do with p values, filtering data of type X from a large pool of data mixing type X and type Y together is one of the legit uses. Anoneuoid’s point about setting the threshold to whatever in order to get the “right number” of genes, is just really a half-assed way to express your prior information: there should be a few, a couple tens, a couple hundreds, or whatever of genes involved in this process. Saying “the omnigenic model is nonsense” is just another way of saying “I have prior information about how many genes I’m interested in”

        • Never mind, wikipedia to the rescue. Let’s suppose you have some trait, like say addiction to alcohol, and you have some large number of people who you’ve analyzed for their SNPs, like say half a million people who did 23-and-me and reported alcohol addiction… (meh)

          First off, I gotta say, the neural network approach here is actually probably a good idea, because even if you don’t think “all genes are involved” it seems entirely likely something like: “there are 112 specific combinations of 35 different polymorphisms that will substantially increase the risk of alcohol addiction” and you have *zero* chance of understanding the 35000 dimensional space well enough to pull out these 112 combinations from your data using any kind of manual specification. You might get away with organizing your data using principle component analysis… but I digress, i’m just saying that Anoneuoid’s “handwaving about neural networks” isn’t handwaving, it’s good intuition that things are likely to be high dimensional and impossible to make progress by excessive reduction, and impossible to find combinations by hand, and obviously he hasn’t sat down to do this work for other people because, hey they should be doing it themselves.

          But that aside, perhaps there are a few SNPs that are very often involved in each of the 112 combinations… Like out of the 112 combinations 85 of them include the SNP named alr1 or something (I’m making this up, alr1 = alcohol risk 1 so if there’s something called that already ignore it.) So you’re interested in a small number of SNPs that by themselves substantially add risk.

          So, you run a logistic regression: alcadd ~ inverse_logit(f(snps)+c) and you need to specify the function f(snps). You start with a linear combination a[i] * snp[i] where snp[i] is either 0 if the person doesn’t have it and 1 if they do. Now you need to specify a prior for the a[i]… Your basic premise is you’re interested in sparse solutions, so you go read Aki’s paper on the modified horseshoe prior: https://arxiv.org/pdf/1707.01694.pdf and you express a prior over the sparsity (ie. that there are probably only around N SNPs involved, which sets the shape of the horseshoe). Run your regression, and come up with coefficients, now rank your coefficients in descending order and investigate the biological processes that each one is involved in…

        • Thanks, but that’s not what I asked. I asked for an alternative to the GWAS NHST paradigm, which exists to find genetic main effects. The genetic variance of just about all human traits can be mostly accounted for by a simple additive model involving numerous variants of small effect. Gene-gene interactions or large-effect variants, even if they exist, are not of interest here.

          Perhaps it’s just that despite frequent claims to the contrary, NHST works just fine if it’s used in a scrupulous manner. Finding genetic main effect seems to be one of those problems where NHST works well.

        • > I asked for an alternative to the GWAS NHST paradigm, which exists to find genetic main effects.

          And clearly you didn’t understand what I told you, because that’s exactly what you got.

        • Well, perhaps you could elaborate on your method, because this seems like a gene-gene interaction model to me:

          there are 112 specific combinations of 35 different polymorphisms that will substantially increase the risk of alcohol addiction

          The point of the additive model is that there are no “specific combinations” of any variants that are particularly important. Rather, the presence or absence of a given variant is what matters. This is why genetic relatedness is linearly related to phenotypic resemblance. How would your method perform compared to GWAS if the true causal model is say 10,000 small additive effects, using say, 1,000,0000 variable loci in 1,000,0000 people?

        • Matt, Daniel’s arrived at the same “main effects” model as is used in the usual p-value approach. He’s saying that instead of computing p-values you can use a multi-level model that treats the main effects as coming from a common distribution. Which distribution? One that induces sparsity, which is to say, its prior prediction is that the vast majority of the effects are close to zero and perhaps a few are large. The specific prior he recommends is the so-called Finnish horseshoe prior. After you’ve done the estimation, pick the effect that are large enough to be interesting or the top N or what have you.

        • you took my digression defending Anoneuoid’s idea as if it were my own actual idea.

          My own actual idea, how to replace your GWAS with something else, starts at “But that aside,…” and describes how to fit a Bayesian model of the kind of interaction you are explicitly asking for (a linear sum) where you have some idea about the number of “true causal effects” (your 10,000 effects number) compared to all the effects…. using a sparse horseshoe prior to express your knowledge about the order of magnitude of the number of “nonzero” coefficients in that model.

        • you took my digression defending Anoneuoid’s idea as if it were my own actual idea.

          Well, you said it was “probably a good idea” and the genetic architecture you use when discussing your preferred method is again highly unrealistic. If there are large effects, they have already been found, so they’re not of interest in GWAS. What is of interest is finding numerous small effects. The objective of GWAS is to find all true effects, i.e. those that replicate in independent samples. You know that you’ve found all of them when the variance that they collectively explain equals the narrow-sense heritability of the trait. The point, for me at least, is not to find large or “biologically interesting” effects but rather to statistically account for full heritability in molecular genetic terms.

          I don’t quite grasp your idea, but I would suggest you grab some data from the UK Biobank and show that your method significantly outperforms traditional GWAS and is not computationally too expensive. The resulting paper would be your most cited ever ;) (There may be some similar methods circulating about in genetics already, but they’re not popular.)

        • Suppose I have a friend Ernest “Magic” Jackson, he is a really useful person to have in your research group, because while he can’t tell you true model that describes the universe, if you tell him the model you’re interested in using, he will gaze into his crystal ball and write down the value of all the coefficients in your model to full 64 bit floating point precision. These will be the coefficients that minimize whatever your error measure is when measured across every single currently living human. There is no out of sample error for him.

          now, you tell him your model involving all 650,000 SNPs, and he writes down all the coefficients. Do you think even *one* of those 650,000 coefficients will be 0.0 ?

        • grab some data from the UK Biobank and show that your method significantly outperforms traditional GWAS and is not computationally too expensive

          Can you clarify this:

          1) Pick an exact dataset you would want this done on
          2) How is performance measured?
          3) What does “significantly outperform” mean?
          4) Can you link to some examples of traditional GWAS performance?
          5) What is “computationally too expensive”?

        • If your answer is “no” then, if it’s true that “The objective of GWAS is to find all true effects, i.e. those that replicate in independent samples” then we can stop here. just list all the SNPs.

          if the answer is yes, can you explain why there should be a whole bunch of 0.0 values to 64 bit precision (smallest 64 bit machine float bigger than 0 is about 2.2×10^-308, Magic Jackson is amazing, he’ll give you all that, because he calculates your error quantity in 4096 bit ultra high precision floats, so he’s not limited by machine epsilon…

        • When I speak of p-values in the context of GWAS, I’m of course referring to the use of a particular p-value-related decision rule, that is, the significance threshold of 5 * 10^-8.

          People use p-values for other purposes (eg, ranking) so there is no way to know this.

          the omnigenic model is nonsense;

          Did you explain this in the previous thread?

          I genuinely want someone to describe a realistic alternative to current practices, not just some handwaving about neural networks.

          What is unrealistic about it? What I was thinking of is very simple to do.

          I think you just want people to suggest another statistic to compare to an arbitrary threshold to tell you which correlations are “real” or not. It really doesn’t matter what you do then, there is no “right way” to do that.

        • Regarding the omnigenic model, see this paper: https://www.ncbi.nlm.nih.gov/pubmed/30150289 They recover just about all of the “chip heritability” for height using 20,000/650,000 = 3% of the available variants. Omnigenic shomnigenic! Note also that the model sums the SNP effects, indicating that you can forget about gene-gene interactions. (Their method is, of course, not an alternative to GWAS because it’s about polygenic prediction and does not give the effects of individual loci.)

        • I don’t have access to this paper, but are you trying to argue that because you can create a 20,000 dimensional model of height and 20,000 is a small fraction of 650,000 dimensions available to you, that somehow it disproves something Anoneuoid said?

          also:

          Note also that the model sums the SNP effects, indicating that you can forget about gene-gene interactions.

          not if you care about mechanism. It’s been known for years and years that prediction from a simple “improper” (ie. equal weighted) linear model is robustly successful at predicting lots of things across the board: https://www.cmu.edu/dietrich/sds/docs/dawes/the-robust-beauty-of-improper-linear-models-in-decision-making.pdf that doesn’t mean criminal recidivism works by criminals summing up their age, sex, race, income, previous criminal history, and relationship status and figuring out if it exceeds a threshold that they should go out and mug someone.

        • I don’t have access to this paper

          Ever heard of Sci-hub, old man?

          The omnigenic model claims that every gene in the genome affects every trait. Lello et al. show that you can recover the SNP heritability of height using just 3% of the available SNPs. That contradicts the omnigenic model.

          not if you care about mechanism

          I don’t care about mechanisms at all. If 20,000 variants affect a trait, only a fool will think that they will be able to build a mechanistic model of that. As I said in the previous thread, GWAS is about creating polygenic scores to be used for predictions and interventions.

        • >I don’t care about mechanisms at all.

          then you’re not doing science. problem solved!

          Look, there’s lots of good stuff to do in engineering, which is what you’re talking about, coming up with a way to arrange certain devices so you can get an economically useful outcome, like maybe a rule for who should get alcoholism counseling at age 18 or which people should get annual mammograms or whatever… but don’t confuse that with science, which is by definition trying to figure out the mechanism behind how things work.

          > only a fool will think that they will be able to build a mechanistic model of that

          Hell, a golf ball consists of 10^23 individual molecules each flying through 10^26th different air molecules, and each of these contributes equally and symmetrically to the outcome, no fool would touch that stuff.

        • “then you’re not doing science”

          That’s hardly fair; there’s a level of abstraction between mere statistical association/correlation/prediction and mechanistic modelling. It’s causal inference (whether done with Neyman-Rubin potential outcome or Pearl-style causal graphs) and it is useful for legitimate scientific investigations.

        • Corey, I was referring to “I don’t care about mechanisms at all.”

          There are plenty of ways to do investigation of scientific questions through abstraction other than reductionist molecular mechanism investigation, but if you literally don’t care about mechanism at all and just want a predictive tool for decision making… this is pure engineering, which is a valuable thing to do, it’s just not aimed at the same target.

        • “Ever heard of Sci-hub, old man?”

          1) Download/install: https://addons.mozilla.org/en-US/firefox/addon/redirector/

          2) Create new redirects:

          Description: scihub
          Example URL: https://doi.org/10.1086/288135
          Include pattern: https://doi.org/*
          Redirect to: https://sci-hub.tw/$1
          Pattern type: Wildcard

          Description: scihub2
          Example URL: https://dx.doi.org/10.1086/288135
          Include pattern: https://dx.doi.org/*
          Redirect to: https://sci-hub.tw/$1
          Pattern type: Wildcard

          Universal one click access is much better than institutional log in. The service saves dozens to hundreds of hours per year.

        • For those of us old enough to remember the good old 90’s and how the RIAA used to sue the parents of 8 year olds for their entire life savings after they illicitly downloaded a few tripey pop songs… we say keep your tripey pop sci.

        • You have a restrictive view of science, Daniel–much of medicine is not science per your definition, for example–but I don’t care if you call it science or engineering or whatever. When I say that I’m interested in using polygenic scores for prediction, I mean it in the sense of causal inference as described by Corey. A polygenic score that works only because of population stratification would not be of interest to me, for example.

        • They recover just about all of the “chip heritability”

          Without going into what exactly “chip heritability” means, you say “just about all”.

          Sounds like from the start you admit the remaining genes still have room to contribute then. No one said these correlations were not negligible… it is just that by raising the sample size you allow more and more negligible effects to pass into “significance” at the same threshold. With large enough sample size (or lax enough threshold) all the genes will eventually be included.

          So no, these results do not at all contradict “the omnigenic model” as you call it (to me it is just “everything correlates with everything else”).

          By skimming the paper I can see other problems with what you are saying as well (all these “effects” assume the model they use is correct, change the model and you’ll get different coefficients).

        • I’m interested in using polygenic scores for prediction

          If you were really only interested in this you would be using ML methods and the entire concern would be out-of-sample (not cross validated, but real unseen data) predictive skill. The method used is irrelevant if that is all you care about.

          I will easily beat any sort of statistical significance filter on that. Probably could do it in a couple hours if the data is clean.

  18. I’m currently reading uncertainty (Briggs 2016) and while he has many strong fundamental opinions that may be non-consensus, he does provide interesting solutions I haven’t seen in the discussion here, e.g. evaluation of conditional predictive probabilities of the model instead of international model parameter estimates. Can anyone comment on that?

    • Well, I guess no one else is going to speak up so I might as well. I don’t know that folks around here are as — hmm, need a label — let’s say, as “predictionist” as Briggs but it is certainly recommended to sanity-check priors via prior predictive simulations. I myself am pretty close to Briggs in philosophical outlook, a fact which distresses me on account of I don’t care for him personally.

  19. Next you will be telling us that juries should not be reduced to saying Guilty or Not Guilty. Or that colleges should not have to just admit or reject an applicant. Or that the FDA should not have to approve or reject a drug.

      • I disagree. With double-blind randomized controlled trials, (a main intended use for simple sig tests) proper uses of p-values manage to indicate promising drugs & block unwarranted claims about having ruled out expected variability. CIs are supplied as well. Primary endpoints are chosen, not in a context free manner, but to reflect the theory as to how a drug or factor works, and the most direct way it is thought the experiment can pick up on its effects. Hormone Replacement Therapy trials in 2002 showed HRT (Prempro) to have very small but genuine increased risks, rather than the supposed benefits,regarding heart disease, cancer, and much else in menopausal women–-the trials were stopped years in advance as a result. The data also revealed biases (healthy woman’s syndrome) overlooked by observational studies for many years, which led to improvements. If they had to wait until they had a distribution of effect sizes, women would not have been alert to these risks. (In 2013 the dramatic drop in breast cancer rates was attributed to the drop in HRT prescriptions among women of the relevant age group).

        The point null is mostly advocated by Bayes Factor advocates. One-sided tests are generally preferred (or 2 one-sided tests).

        While I advocate interpreting test results so as to indicate discrepancies (from a test hypothesis) that are well and poorly indicated, the simple significance test has its uses. You yourself use p-values to test models, do you not? Those test hypotheses must be of interest to you, enough to seek a revised model based on low p-values. Failures of replication are based on these simple tests, as are arguments about model violation and data too good to be true. The fact that biasing selection effects, multiple testing, data dredging and the like result in invalidating p-values is actually a major reason that the same tools serve in fraudbusting & uncovering QRPs. The fact that they’re a small part of a full error statistical methodology isn’t a good reason to banish the term or the tests

        • Deborah:

          There were three examples I was referring to and which I said should not be made based on a tail-area probability defined with respect to a straw-man null hypothesis of no interest. These three examples were jury decision, college acceptance, and drug approval. You discussed one of these decisions—drug approval—so I will respond to you on that one. Your discussion refers to an example of a study showing hormone replacement theory to have increased risks, rather than benefits, for the target population. I strongly agree with you that it’s a good idea to use data from controlled trials, where available, to inform decisions. But I don’t see what this has to do with p-values. I think this decision could be made much more directly in terms of expected outcomes.

          I do accept that a tool, when available, can be used for different purposes. In particular, I recognize that null hypothesis significance testing can in some settings be used for parameter estimation, and that uncertainty statements obtained from null hypothesis significance testing can be treated as probabilities and used in decision analysis. So, yes, the tool can work to solve real problems. But, to me, the reason why the tool works is not because of the hypothesis testing but because they are used to create estimates and uncertainties.

          It’s as if someone were needed a hammer and a chisel, and was using a couple of screwdrivers to serve both these functions. This can work (in problems where you don’t need a good hammer or a good chisel), but I wouldn’t take this as evidence that we need screwdrivers. This is not the best analogy because screwdrivers are useful in their own way, but maybe it will give some sense of my perspective here.

          P.S. As always, I hope that somebody is still reading this deep into the comment thread.

        • “So, yes, the tool can work to solve real problems. But, to me, the reason why the tool works is not because of the hypothesis testing but because they are used to create estimates and uncertainties.”
          Even this suffices to deny that the results of statistical significance testing are useless for scientific or practical purposes. That the testing result is built on to create other claims doesn’t mean the latter didn’t depend on the former, or that the latter goal alone should matter. Finding the genuine risk increase–with design-based probabilities–even restricted to the experimental population, & without estimating a distribution of effect sizes, is of value. It’s this type of information that gives an indication of what to do next to learn more.
          There are also uses of simple significance tests that don’t seem to go on to give estimates, e.g., discover an assumption of a fixed mean or variance is rejected. Would you describe your uses of p-values in testing assumptions as estimations?

        • Deborah:

          Again, there were three examples I was referring to and which I said should not be made based on a tail-area probability defined with respect to a straw-man null hypothesis of no interest. These three examples were jury decision, college acceptance, and drug approval. You discussed one of these decisions—drug approval—and in my comment just above I explained why I don’t think these decisions should be based on tail-area probabilities, while also saying why I understand that tail-area probabilities, if they are the only tools available, can be used to hack together a workable solution in particular problems. Again, I think p-values (or, for that matter, Bayes factors) are the wrong tool, and null hypothesis significance testing the wrong framework for making these decisions, but that they will be adapted, for better or worse, to problems at hand, if these are the only tools and frameworks readily available. Much of my career has been spent in an effort to make other, more flexible, tools and frameworks available in more settings.

    • I answered it, but I agree that a number of the questions were ambiguously worded. I always answer surveys, though, even those in which I can see that my answers will be misinterpreted; why should my responses be any less valuable than some unmotivated drone on Mechanical Turk?

  20. I’m not sure why you list “genetic association studies” in your parade of shame. Do you mean small sample candidate gene studies? Well-powered genome wide association studies (yes, using a Bonferroni-adjusted p value) have proven robust (at least in many cases) as evidenced by replicability and biological validation.

  21. From the end of the linked article:
    “Uniformity in statistical rules and processes makes it easier to compare like with like and avoid having some associations and effects be more privileged than others in unwarranted ways. Without clear rules for the analyses, science and policy may rely less on data and evidence and more on subjective opinions and interpretations.”

    I agree that “Uniformity in statistical rules and processes makes it easier to compare like with like and avoid having some associations and effects be more privileged than others in unwarranted way” — especially with the “Makes it easier” (if the rules indeed fit) — but I just don’t see how it would be possible to have “one size fits all” rules.

    “Without clear rules for the analyses, science and policy may rely less on data and evidence and more on subjective opinions and interpretations.” This view throws the baby out with the bath. What are needed are clear explanations/justifications of why the methods chosen are the best for the particular situation, combined with critical reading of these explanations/justifications. Rarely do papers include such explanations/justifications; instead, methods are often chosen just because “that’s the way we’ve always done it.”

  22. Aside.

    Computer science, CS, is not a social science. It’s an engineering discipline. This is confused because there’s a related discipline: ‘software engineering’, SE, too. Yet SE doesn’t really cover every aspect of software construction. For example, programming language disputes over functional vs. object oriented and strong vs. weak types don’t happen in SE. They’re found in computer language design circles (firmly part of CS, not SE). Such disputes can be resolved by gathering metrics (which a software engineer does). These are not ‘religious disputes’. In contrast, within climate science, disputes over models are disputes over the scientific method. These appear religious in nature because careers are at stake. Dare I suggest that careers are probably at stake in those ‘religious’ disputes you see in computer science?

Leave a Reply

Your email address will not be published. Required fields are marked *