“Inferential statistics as descriptive statistics”

Valentin Amrhein​, David Trafimow, and Sander Greenland write:

Statistical inference often fails to replicate. One reason is that many results may be selected for drawing inference because some threshold of a statistic like the P-value was crossed, leading to biased reported effect sizes. Nonetheless, considerable non-replication is to be expected even without selective reporting, and generalizations from single studies are rarely if ever warranted. Honestly reported results must vary from replication to replication because of varying assumption violations and random variation; excessive agreement itself would suggest deeper problems, such as failure to publish results in conflict with group expectations or desires. A general perception of a “replication crisis” may thus reflect failure to recognize that statistical tests not only test hypotheses, but countless assumptions and the entire environment in which research takes place. Because of all the uncertain and unknown assumptions that underpin statistical inferences, we should treat inferential statistics as highly unstable local descriptions of relations between assumptions and data, rather than as generalizable inferences about hypotheses or models. And that means we should treat statistical results as being much more incomplete and uncertain than is currently the norm. Acknowledging this uncertainty could help reduce the allure of selective reporting: Since a small P-value could be large in a replication study, and a large P-value could be small, there is simply no need to selectively report studies based on statistical results. Rather than focusing our study reports on uncertain conclusions, we should thus focus on describing accurately how the study was conducted, what problems occurred, what data were obtained, what analysis methods were used and why, and what output those methods produced.

I think the title of their article, “Inferential statistics as descriptive statistics: there is no replication crisis if we don’t expect replication,” is too clever by half: Ultimately, we do want to be able to replicate our scientific findings. Yes, the “replication crisis” could be called an “overconfidence crisis” in that the expectation of high replication rates was itself a mistake—but that’s part of the point, that if findings are that hard to replicate, this is a problem for the world of science, for journals such as PNAS which routinely publish papers that make general claims on the basis of much less evidence than is claimed.

Anyway, I agree with just about all of this linked article except for my concern about the title.

27 thoughts on ““Inferential statistics as descriptive statistics”

  1. > Ultimately, we do want to be able to replicate our scientific findings.
    Yes, but in the group of studies conducted not the supposedly informed claims made in each separately over time.

    I understand the abstract as mainly being about forgoing the dance of individual study claims on their own to an unfolding joint analysis of all of them that avoids all the distracting and unnecessary drama.

  2. Having just reviewed somewhere between 50-100 epidemiological studies of nutritional elements as risk factors for prostate cancer, I agree with the much but not all of this extracted text by Amerhein et al. except I think replication is a reasonable objective. For me, after reading papers mostly reporting a 95% CI, wherein about half conclude a factor is not significant and thus should be ignored, and somewhere near half say it is significant, mostly with little or no interpretation or discussion of the underlying biology/pathology, and almost all based on a P<.05 CI – and often the actual P value is not presented and almost never considering that these papers make a plethora of multiple comparisons, it is difficult to find the utility of these efforts as applicable to PC patients. These authors seem to think that P,.05 vs P<.06 is an important distinction to a cancer patient. My bigger issue is, given the importance of the issues examined in these epi papers, how should these results be described, articulated, presented? The last sentence of the above extracted text is fine, except it begs the question of how the interpretation of such results should be accomplished.

    • “often the actual P value is not presented and almost never considering that these papers make a plethora of multiple comparisons…. ”

      This problem has nothing to do with the method, it’s at best poor publishing practices and possibly sloppy work. If NHST is the method for determining significance, then the data generated for that determination should be published, along with all the other comparisons generated during the research.

    • Re”how the interpretation of such results should be accomplished”:
      Did you read the entire published article at http://www.tandfonline.com/doi/pdf/10.1080/00031305.2018.1543137
      ? Because I think we did attempt to answer that question, as did severalother papers in that special issue including McShane et al.,
      https://www.tandfonline.com/doi/pdf/10.1080/00031305.2018.1527253
      and more narrowly, mine,
      http://www.tandfonline.com/doi/pdf/10.1080/00031305.2018.1529625
      Then I and others discussed more recommendations in a recent NISS webinar; the talks can be played back and the slides are downloadable at
      https://www.niss.org/news/digging-deeper-radical-reasoned-p-value-alternatives-offered-experts-niss-webinar

      – I’d be curious if after reading all that you still have questions about how the authors think we should interpret individual study results. Until then: My general loose guideline is “if you can at all avoid it, don’t interpret individual study results”. Instead interpret the entire literature providing information on the general research question you are pursuing (which is largely composed of individual studies in multiple fields).

      Focusing on “interpretation” of single studies (beyond the extraction of information relevant to the motivating question) is in my view toxic to getting valid general answers. Yet it is demanded at every step of modern research from proposal to publication. Meeting that demand is the cause of “the replication crisis”, and statistics long catered to it by pushing automated decision rules (e.g., P0.05, a fallacy denounced by both Karl Pearson and R.A. Fisher).

      Today at least some of the statistics profession is backing off those destructive teachings. But those teachings are still being pushed by those who have imposed them on hapless authors (e.g., JAMA), and who in doing so warped reporting to serve ends other than impartial evaluation and transmission of evidence.

      • Sander said,
        “Focusing on “interpretation” of single studies (beyond the extraction of information relevant to the motivating question) is in my view toxic to getting valid general answers. Yet it is demanded at every step of modern research from proposal to publication. Meeting that demand is the cause of “the replication crisis”, and statistics long catered to it by pushing automated decision rules (e.g., P0.05, a fallacy denounced by both Karl Pearson and R.A. Fisher).

        Today at least some of the statistics profession is backing off those destructive teachings. But those teachings are still being pushed by those who have imposed them on hapless authors (e.g., JAMA), and who in doing so warped reporting to serve ends other than impartial evaluation and transmission of evidence.”

        +1

        • Martha:

          It seems to be an area where even among most statisticians there is a reluctance to acknowledge the usual lack of scientific value of a single study on its own. Its almost like an aversion to admitting a type of impotence.

          Now when I taught intro stat courses at Duke in 2007/8, I included a couple lectures on _investigating_ and combining studies (meta-analysis).

          It seemed to be one of the topics that the students found not too hard. One example the first p_value was .03 followed by studies whose combined p_value was .30. Another the first p_value was .30 followed by studies whose combined p_value was .03.

          Since I have not heard of anyone else doing this (yes there is often a meta-analysis example given in hierarchical modelling topic but all the studies seem to “arrive” at once). Do you know if anyone has put such stuff in a intro stats book?

        • I think it also depends a lot on how studies are done and what lab did them. In general I think you can bank on studies done by certain people. They find a phenomenon, then they replicate it, then they teach the technique to someone else in the lab and that person replicates it, then they work out whether it’s a measurement artifact, then they ask what factors might moderate it, like mouse strain, and they replicate it in different strains, then they ask what underlying causes might be there and test four or five hypotheses, then after one of them pans out they replicate that under different conditions… And then after two to five years of work… They publish a paper laying out the case.

          This is actually the way science is supposed to work. It’s explicitly there in Feynmans Cargo Cult speech. But the publish or perish and grant funding system we have today incentivizes people to publish a lot of sloppy poorly thought out flashy work. You can argue that the solution is meta analysis but I don’t think that’s a solution, I think that’s a useful countermeasure to what is essentially scientific pollution. The solution is to stop the pollution and that’s not easy. It’s as hard as the general reboot needed to undermine the prevalence of endless copyright, submarine patents, tariff, subsidy, warmaking for profit, for profit prisons, financial and other regulatory capture, government earmarks for government contractors, stockpiling ineffective drugs, hard selling injurious and addictive drugs, and other kinds of rent seeking in the broad market. It’s become terrible.

          When we realize that the purpose of most of these publications is to extract money from the government to further careers without regard to scientific truth, and they are unreliable on their own because there is very little desire by the authors to do what is necessary to make them reliable… Things become a lot clearer. Meta analysis is like distilling your water because the water utility is actively putting lead in it because they own a water filter manufacturing plant and can sell a lot more filters. It works but it’s not a good situation.

        • > Meta analysis is like distilling your water because the water utility is actively putting lead
          Actually more like they sometimes are putting unknown stuff in it and might be hiding that or at least obscuring it in the paper.

          But I guess then the single study fetish that most statisticians seem to have is to comfortably plug into that process.

        • Sure, the point is they’re putting some crap in specifically because they know they benefit from doing it (sell more filters)…

          I don’t like the single study fetish either, what I think we need is both things:

          1) Studies that are done to the kind of standards I mentioned above, where replication is already done as a routine thing within the lab and across multiple populations and with careful measurements and scientifically testing possible confounding effects etc.

          2) Open data should be as much a requirement as open access to journal articles is for US govt funded health research today and meta-analysis of open data sets should be a whole industry.

          3) Random audits of grants by third party labs, I’d say about a 10% probability. Wansinks of the world would never survive that for long.

        • Keith said, “Do you know if anyone has put such stuff in a intro stats book?”

          I don’t recall seeing an intro stats book that mentions meta-analysis (but I haven’t looked at all intros stats books — there are oodles of them, and a lot have reasons to be rejected after a short glance ).

        • Well at least there’s this (hopefully honorable) mention:
          Greenland S, O’Rourke K (2008). Meta-analysis. Ch. 33 in Rothman, Greenland, Lash (eds). Modern Epidemiology, 3rd ed.
          …but I would add that for serious risk analyses, bias analysis (Ch. 19 of the same book) should play a bigger role than was indicated there. A problem however is that bias analysis for meta-analysis was and still is not a fully developed, settled topic, and the abuse potential is enormous.

        • P.S. We published some bias analysis guidelines here:
          Lash, T.L., Fox, M.P., Maclehose, R.F., Maldonado, G.M., McCandless, L.C., Greenland, S. (2014). Good practices for quantitative bias analysis. International Journal of Epidemiology, 43, 1969-1985.

      • For my part, I’m stuck on the sidelines of this epic battle to change the entire research industry. It’s true I dream of the day when I can work on analyses meant to be published under an enlightened view as espoused by Amrhein, Trafimow and Greenland. So that’s the team I’m “rooting for”, so to speak.

        But then again my career has less than a decade remaining and I’ve got to say that seems much too short a period in which to expect that kind of paradigm shift. Here’s my dilemma. When every publication is forced to “interpret” its own results using “decision rules” like p<0.05 I simply can't see much room for me to push back on the prevailing practices.

        What I can't see is any principled way to push back while still getting papers past reviewers who feel duty-bound to enforce the p-value orthodoxy. There's just a lot of meet in the middle available, from my perspective. The paper we're discussing does attempt concrete suggestions about wording and so forth but if we're forced to go along with culling the results of our Forking Paths based on p<0.05 in order to reach publication no improvements in wording are going to change the nature of the process.

        • Brent. Just because you may be required to publish p values and soforth, doesn’t mean you are required to fool *yourself* into thinking your research means something on the basis of those p values.

          In my opinion, those who want to do good research should do analyses that are required by good research practices. If you do for example a variety of bayesian analyses involving custom built models that include effects like measurement error, variation in effect sizes across sub populations and soforth, and you provide an argument for your choice of priors, and then you publish posterior probability estimates instead of p values, lots of people probably won’t even know what you’re doing is different, but YOU will.

          If you are somehow forced to provide p values in addition to that stuff, I don’t think it has nearly the harmful consequences as if you just did some simple default modeling, and took small p values to mean “an effect exists and is important”. That’s where the real damage is done, people fooling themselves.

  3. The replication crisis exists in fields in which the practitioners perform experiments that are intentionally set up – or in some cases prosecuted – to be most likely to produce the desired result. In fields where no replication crisis has been noted, say spectroscopy or biochemistry, the scientists test their ideas in settings least likely to produce the desired result. Because if you don’t, your rival will.

    This article is yet another that attacks the way statistical inference is used, rather than being about what it can do when properly prosecuted. So, ironically enough, the generalizations are unwarranted, IMO.

    • +1!

      I like a lot that’s in the text above but I agree that it’s relevant mostly to social sciences and not relevant to problems in the medical sciences, where work is much more tightly controlled to begin with.

      So how does one characterize boundary (between controlled experiments testing a single effect and wild experiments subject to gazillions of unrecognized assumptions) in a single bullet point?

  4. I’ve been meaning to post this for some time, a quote from a 1975 American Statistician paper by W. Edwards Deming (“On probability as a basis for action”): “Little advancement in the teaching of statistics is possible, and little hope for statistical methods to be useful in the frightful problems that face man today, until the literature and classroom be rid of terms so deadening to scientific enquiry as null hypothesis [and] level of significance for comparison of treatments.”

  5. Amrhein​, Trafimow, and Greenland should focus much more details on more things than just p-values, but also BFs, posterior probability, or many other possible things people can compute, instead of just giving them a few sentences, else it lends to the misleading idea that it is just a p-value problem, which it is not. The ‘dance of the p-value’ and ‘dance of confidence intervals’ idea applies to everything else as well.

    They recommend to:

    “Not use the words “significant” or “confidence” to describe
    scientific results, as they imply an inappropriate level of
    certainty based on an arbitrary criterion, and have produced
    far too much confusion between statistical, scientific,
    and policy meanings.”

    I read things like this a lot but it is wrong. First, an interval by its definition and not being a point estimate implies an appropriate level of uncertainty (as do Type I/II errors often discussed) instead of an inappropriate level of certainty. Also, replication and meta-analysis nicely nip overcertainty concerns in the bud for the most part in my opinion.

    Second, alpha is not, or should not be, arbitrary, but instead based on cost of making Type I error or sample size or any adjustments for multiple testing. And even alpha=.05 is not totally, completely “arbitrary”. Fisher basically said it was convenient, and resulted in a z-score of about 2, and made tables in his books (pre-computer times) easier. But, and this is the important part, Fisher knew and wrote about that it roughly corresponded to previous scientific conventions of using probable error (PE) instead of standard deviation (SD). The PE is the deviation from both sides of the central tendency (say a mean) such that 50% of the observations are in that range. Galton wrote about Q, the semi-interquartile range, defined as (Q3-Q1)/2, which is PE, where Q3 is the 75th percentile and Q1 is the 25th percentile. For a normal distribution, PE ~ (2/3)*SD. Written another way, 3PE ~ 2SD (or a z-score of 2). The notion of observations being 3PE away from the mean as very improbable and hence “statistically significant” was essentially used by De Moivre, Quetelet, Galton, Karl Pearson, Gosset, Fisher, and others, and represents experience from statisticians and scientists, not just a plucking out of thin air. See “On the Origins of the .05 Level of Statistical Significance”, by Cowles and Davis.

    Last, equivalence testing, for example, with its smallest effect size of interest (SESOI), can help make the differences between statistical and practical significance clearer. Again, the correct use of p-value and more hypothesis testing is a possible solution, not less of these things and use of unproven alternative methods with barely discussed drawbacks.

    For example, I don’t see any real advantage in reporting -log_2(p-value). We are currently already going from raw data to summaries like means and SDs to standardized values like z-scores, and finally to p-values. Now we add an another step and look at a transformation of the p-value? Probability is already a fairly natural scale, and small p-values already correspond to large values of your test statistic. If you want something intuitive and on a good understandable scale just use the observed data. I’m also not sure I think reporting bits is so intuitive. I read that winning the lottery is about 24 bits of surprisal, but that is not as intuitive to me as a really, really small probability (“1 chance in X”). I read that writing 24 is more manageable than writing out a really, really small probability…but we can just write really, really small probabilities using scientific notation.

    Justin

    • Also, consider that the journal Basic and Applied Social Psychology (BASP) took banned the use of significance testing in all its submissions in 2015. Did science improve? Ricker et al in “Assessing the Statistical Analyses Used in Basic and Applied Social Psychology After Their p-Value Ban” write

      “In this article, we assess the 31 articles published in Basic and Applied Social Psychology (BASP) in 2016, which is one full year after the BASP editors banned the use of inferential statistics…. We found multiple instances of authors overstating conclusions beyond what the data would support if statistical significance had been considered. Readers would be largely unable to recognize this because the necessary information to do so was not readily available.”

      They were OVER-stating conclusions, OVER-stating certainty, not being more uncertain as is claimed would happen if there is no formal inference using fixed alpha cutoffs, for example.

      Also see https://daniellakens.blogspot.com/2016/02/so-you-banned-p-values-hows-that.html by Lakens

      Justin

Leave a Reply

Your email address will not be published. Required fields are marked *