“Statistical Inference Enables Bad Science; Statistical Thinking Enables Good Science”

As promised, let’s continue yesterday’s discussion of Christopher Tong’s article, “Statistical Inference Enables Bad Science; Statistical Thinking Enables Good Science.”

First, the title, which makes an excellent point. It can be valuable to think about measurement, comparison, and variation, even if commonly-used statistical methods can mislead.

This reminds me of the idea in decision analysis that the most important thing is not the solution of the decision tree but rather what you decide to put in the tree in the first place, or even, stepping back, what are your goals. The idea is that the threat of decision analysis is more powerful than its execution (as Chrissy Hesse might say): the decision-analytic thinking pushes you to think about costs and uncertainties and alternatives and opportunity costs, and that’s all valuable even if you never get around to performing the formal analysis. Similarly, I take Tong’s point that statistical thinking motivates you to consider design, data quality, bias, variance, conditioning, causal inference, and other concerns that will be relevant, whether or not they all go into a formal analysis.

That said, I have one concern, which is that “the threat is more powerful than the execution” only works if the threat is plausible. If you rule out the possibility of the execution, then the threat is empty. Similarly, while I understand the appeal of “Statistical Inference Enables Bad Science; Statistical Thinking Enables Good Science,” I think this might be good static advice, applicable right now, but not good dynamic advice: if we do away with statistical inference entirely (except in the very rare cases when no external assumptions are required to perform statistical modeling), then there may be less of a sense of the need for statistical thinking.

Overall, though, I agree with Tong’s message, and I think everybody should read his article.

Now let me go through some points where I disagree, or where I feel I can add something.

– Tong discusses “exploratory versus confirmatory analysis.” I prefer to think of exploratory and confirmatory analysis as two aspects of the same thing. (See also here.)

In short: exploratory data analysis is all about learning the unexpected. This is relative to “the expected,” that is, some existing model. So, exploratory data analysis is most effective when done in the context of sophisticated models. Conversely, exploratory data analysis is a sort of safety valve that can catch problems with your model, thus making confirmatory data analysis more effectively.

Here, I think of “confirmatory data analysis” not as significance testing and the rejection of straw-man null hypotheses, but rather as inference conditional on models of substantive interest.

– Tong:

There is, of course, one arena of science where the exploratory/confirmatory distinction is clearly made, and attitudes toward statistical inferences are sound: the phased experimentation of medical clinical trials.

I think this is a bit optimistic, for two reasons, First, I doubt the uncertainty in exploratory, pre-clinical analyses is correctly handled when it comes time to make decisions in designing clinical trials. Second, I don’t see statistical significance thresholds in clinical trials as being appropriate for deciding drug approval.

– Tong:

Medicine is a conservative science and behavior usually does not change on the basis of one study.

Sure, but the flip side of formal conservatism is that lots of informal decisions will be made based on noisy data. Waiting for conclusive results from a series of studies . . . that’s fine, but in the meantime, decisions need to be made, and are being made, every day. This is related to the Chestertonian principle that extreme skepticism is a form of credulity.

– Tong quotes Freedman (1995):

I wish we could learn to look at the data more directly, without the fictional models and priors. On the same wish list: We should stop pretending to fix bad designs and inadequate measurements by modeling.

I have no problem with this statement as literally construed: it represents someone’s wish. But to the extent it is taken as a prescription or recommendation for action, I have problems with it. First, in many cases it’s essentially impossible to look at the data without “fictional models.” For example, suppose you are doing a psychiatric study of depression: “the data” will strongly depend on whatever “fictional models” are used to construct the depression instrument. Similarly for studies of economic statistics, climate reconstruction, etc. I strongly do believe that looking at the data is important—indeed, I’m on record as saying I don’t believe statistical claims when their connection to the data is unclear—but, rather than wishing we could look at the data without models (just about all of which are “fictional”), I’d prefer to look at the data alongside, and informed by, our models.

Regarding the second wish (“stop pretending to fix bad designs and inadequate measurements by modeling”), I guess I might agree with this sentiment, depending on what is meant by “pretend” and “fix”—but I do think it’s a good idea to adjust bad designs and inadequate measurements by modeling. Indeed, if you look carefully, all designs are bad and all measurements are inadequate, so we should adjust as well as we can.

To paraphrase Bill James, the alternative to “inference using adjustment” is not “no inference,” it’s “inference not using adjustment.” Or, to put it in specific terms, if people don’t use methods such as our survey adjustment here, they’ll just use something cruder. I wouldn’t want criticism of the real flaws of useful models to be taken as a motivation for using worse models.

– Tong quotes Feller (1969):

The purpose of statistics in laboratories should be to save labor, time, and expense by efficient experimental designs.

Design is one purpose of statistics in laboratories, but I wouldn’t say it’s the purpose of statistics in laboratories. In addition to design, there’s analysis. A good design can be made even more effective with a good analysis. And, conversely, the existence of a good analysis can motivate a more effective design. This is not a new point; it dates back at least to split-plot, fractional factorial, and other complex designs in classical statistics.

– Tong quotes Mallows (1983):

A good descriptive technique should be appropriate for its purpose; effective as a mode of communication, accurate, complete, and resistant.

I agree, except possibly for the word “complete.” In complex problems, it can be asking too much to expect any single technique to give the whole picture.

– Tong writes:

Formal statistical inference may only be used in a confirmatory setting where the study design and statistical analysis plan are specified prior to data collection, and adhered to during and after it.

I get what he’s saying, but this just pushes the problem back, no? Take a field such as survey sampling where formal statistical inference is useful, both for obtaining standard errors (which give underestimates of total survey error, but an underestimate can still be useful as a starting point), for adjusting for nonresponse (this is a huge issue in any polling), and for small-area estimation (as here). It’s fair for Tong to say that all this is exploratory, not confirmatory. These formal tools are still useful, though. So I think it’s important to recognize that “exploratory statistics” is not just looking at raw data; it also can include all sorts of statistical analysis that is, in turn, relevant for real decision making.

– Tong writes:

A counterargument to our position is that inferential statistics (p-values, confidence intervals, Bayes factors, and so on) could still be used, but considered as just elaborate descriptive statistics, without inferential implications (e.g., Berry 2016, Lew 2016). We do not find this a compelling way to salvage the machinery of statistical inference. Divorced from the probability claims attached to such quantities (confidence levels, nominal Type I errors, and so on), there is no longer any reason to privilege such quantities over descriptive statistics that more directly characterize the data at hand.

I’ll just say, it depends on the context. Again, in survey research, there are good empirical and theoretical reasons for model-based adjustment as an alternative to just looking at the raw data. I do want to see the data, but if I want to learn about the population, I will do my best to adjust for known problems with the sample. I won’t just say that, because my models aren’t perfect, I shouldn’t use them at all.

To put it another way, I agree with Tong that there’s no reason to privilege such quantities as “p-values, confidence intervals, Bayes factors, . . . confidence levels, nominal Type I errors, and so on,” but I wouldn’t take this as a reason to throw away “the machinery of statistical inference.” Statistical inference gives us all sorts of useful estimates and data adjustments. Please don’t restrict “statistical inference” to those particular tools listed in that above paragraph!

– Tong writes:

A second counterargument is that, as George Box (1999) reminded us, “All models are wrong, but some are useful.” Statistical inferences may be biased per the Optimism Principle, but they are reasonably approximate (it might be claimed), and paraphrasing John Tukey (1962), we are concerned with approximate answers to the right questions, not exact answers to the wrong ones. This line of thinking also fails to be compelling, because we cannot safely estimate how large such approximation errors can be.

I think the secret weapon is helpful here. You can use inferences as they come up, but it’s hard to interpret them one at a time. Much better to see a series of estimates as they vary over space or time, as that’s the right “denominator” (as we used to say in the context of classical Anova) for comparison.

Summary

I like Tong’s article. The above discussion is intended to offer some modifications or clarifications of his good ideas.

Tomorrow’s post: “Superior: The Return of Race Science,” by Angela Saini

23 thoughts on ““Statistical Inference Enables Bad Science; Statistical Thinking Enables Good Science”

  1. Andrew wrote:
    This reminds me of the idea in decision analysis that the most important thing is not the solution of the decision tree but rather what you decide to put in the tree in the first place, or even, stepping back, what are your goals.

    I’ve made a similar point many times. When shopping it is more important to buy the right object than to get a good deal on what you buy. Thinking carefully about what you need is more important than sharp bargaining.*
    *Of course, both combined is even better.

    Bob

  2. “That said, I have one concern, which is that “the threat is more powerful than the execution” only works if the threat is plausible.”

    Andrew:

    If consistency with something like L.J. Savage’s axioms matters to me, then don’t I already have some reason to follow through on the decision analysis so long as I think it will help me to act in accordance with those standards of consistency? No external threat needed. I’m not trying to cloud the waters here, but I’m pretty sure that decision analysis pioneers like Ronald Howard (Stanford) and Howard Raiffa (Harvard) do talk about the role of the decision maker’s commitment to the underlying normative standard in motivating the decision maker to pursue the analysis. Your “plausible threat” seems to offer a similar account of that motivation. Maybe plausible threat is required when you don’t trust the decision maker?

  3. I agree with many of your points. Tong seems to trying to get too much closure or be too definitive. His points on the flux and back and forth present any real live analysis get somewhat set aside by that. But I think they covered the secret weapon quite thoroughly starting with “A
    single set of data can rarely give definitive results” and repeatedly pointing out the futility of a single set of data.

    What struck me is the remarkable lack interest and uptake by the statistical discipline to engage with multiple sets of data until fairly recently. For instance, “Nelder (1986) called The Cult of the Isolated Study” but 30+ years later few anyone includes that as a topic in statistical service courses. I did in 2007/8 at Duke but I am pretty sure its was unusual then and likely still now?

  4. I really do not understand why medical clinical trials are held up as some kind of ideal. They can result in a very crude rote understanding (if you see this, do that) at best. At worst the results have little to do with any individual situation since they only apply to some average person sampled from a group with properties not altogether like you.

    I’ve said it before, but my attempts to read clinical trial results to help make a personal medical decision have left me without any usable insight.

    • Agreed. The experience really brought home to me Andrew’s oft-repeated complaint that when you compare averages you throw away a lot of possibly interesting data.

  5. Full disclosure: Chris Tong is a good friend.

    There seems to be some confusion here. Tong does not disparage modeling per se; quite the contrary — AFAICS, he views it as a useful aspect of “enlightened description.” What he disparages is definitive statements of **uncertainty** (probability-based, sampling)about the model, such as p-values, CI’s, posterior distributions, etc. of model parameters, which are then often used to judge the scientific worthiness of the models. You may disagree with him about this, of course.

  6. “Medicine is a conservative science and behavior usually does not change on the basis of one study.”

    Actually, as a physician and epidemiologist, my view is that Medicine is an obtuse science. Even a raft of good studies often has great difficulty changing practice. At the same time, a strong marketing campaign will ensconce a new technology firmly in place long before there is even a shred of evidence that it is better than what it replaced (in terms of patient-centered clinical outcomes), or even that it isn’t worse!

  7. I am pleased and grateful to see this discussion of my TAS paper, and for Prof. Gelman’s generally positive reaction. It is true that he and I differ sharply on exploratory vs. confirmatory, but otherwise are not as far apart as it may seem, as I hope to explain below. (Caveat: I do not know the field of survey sampling, even superficially, and perhaps my perspective would be better informed if I did.)

    First, I agree with Bert’s comments above. Right after I quoted Freedman (1995) about fictional models (Sec. 7.1), my very next word is “However.” I then go on to describe how statistical models *can* be useful. Later in Sec. 7.3 I quote Harrell (2018), “a statistical model is often the best descriptive tool, even when it’s not used for inference,” and I give a short example using the loess smoother. Regarding making informal decisions based on noisy data, without waiting for conclusive results: a great deal of my own work falls under this category, and formal statistical inferences are usually inappropriate and should be avoided. A carefully prepared descriptive account of the data at hand is often sufficient for making such decisions (including the use of modeling tools such as the loess smoother); and when it isn’t, the answer is to get more and better data. Regarding making statistical adjustments, I gave one such example myself (batch effects, Sec. 7.4) but I do not consider such adjustments “inferential” in the sense given in Sec. 2: by expressing uncertainty through a probability claim. (I concede that point estimation, including for adjustments, is also part of statistical inference, but thus also subject to model selection bias. However, this is not the main focus of my argument, as Bert noted: it is the use of probability statements to express uncertainty.)

    I claimed that “There is, of course, one arena of science where…attitudes toward statistical inferences are sound: the phased experimentation of medical clinical trials.” Prof. Gelman responded that he doubts that “the uncertainty in exploratory, pre-clinical analyses is correctly handled when it comes time to make decisions in designing clinical trials.” This is a very puzzling remark, but anyone worried about it should remember that Phase II (therapeutic exploratory) trials take place between pre-clinical research
    and phase III (confirmatory) trials, and for good reason. The greatest uncertainty between pre-clinical and clinical trials is simply the leap from animal models to humans, and this uncertainty *cannot* be fully quantified statistically. Thus, if there is a way to “correctly handle” it, many of us would like to know. Please bear in mind too that pre-clinical studies conducted within industry are incentivized to be “right” (ie, reproducible) rather than merely publishable, unlike academic pre-clinical research, as Steve Ruberg reminded me. It is no accident that some of the earliest reports warning of the crisis of non-reproducible research were given by pharmaceutical industry pre-clinical scientists (Prinz, et al, 2011; Begley & Ellis, 2012).

    Finally, with regard to the “secret weapon”, I recommend the paper by Michael Lavine (in the same TAS special issue), which gives excellent case studies.

    • Christopher:

      With all due respect and acknowledging that I thought yours was one of the best papers in the TAS journal issue, you seem to be expressing too much certainty in your claims here that you argue against interpreters of statistical analyses making in you paper.

      “This is a very puzzling remark”, “actually from Piantadosi (2017), properly cited” and “greatest uncertainty between pre-clinical and clinical trials is simply the leap from animal models”.

      From my experience in academic clinical research and drug regulation, my colleagues and I were often not sure “the uncertainty in exploratory, pre-clinical analyses is correctly handled”, the [adoption] “long before there is even a shred of evidence that it is better than what it replaced” was documented to be the case in academic cancer research, at least for serious cancers with few viable treatments and animal studies in some case provide less uncertainty than what can be ethically done in human studies.

      Also I would disagree with “industry incentivized to be “right” (ie, reproducible)” – they are incentivized to obtain approval with the least number of restrictions.

      • The rampant non-reproducibility of published pre-clinical research is also well-documented, of course, not only in cancer, but in several other therapeutic areas. A mature (industry) sponsor of a clinical development program is acutely aware of the substantial financial commitment involved, thus, would not be willing to base such a program on research that they have not attempted to replicate themselves. Hence my comment that pre-clinical scientists *in industry* were among the first to warn of the non-reproducible research crisis. In this setting, attempting to replicate the research is the “correct” way to handle *part* of the pre-clinical uncertainty.

      • Keith,

        As one employed in pharma for decades I can assure you that there is as much interest in killing drugs in early stage clinical trials as advancing them to Ph 3 and submission/approval. Pharma cannot afford to advance as many drugs as possible into Ph 3 for affordability reasons. Therefore, they are keen to select only the best treatments to go into VERY expensive Ph 3 trials and certainly want to avoid those that do not work.

        So, yes, pharma wants to get drug approvals, but only for drugs that work better than the competitors, are safe, and can therefore make a return on investment. As such, there is a VERY strong emphasis on and financial incentive for “getting the right answer” in Ph 2. The consequences for getting it wrong are very expensive and damaging to the research enterprise of a pharma company.

        PS The calculus is a bit different for a start-up or small company that only has one treatment in their pipeline. In that case, I believe there is a bias to get a positive answer in Ph 2 and progress to Ph 3. The counter-balance to this is the investors who do not want to throw good money after bad. So, even in that circumstance there is some incentive to “get it right.”

        • > some incentive to “get it right.”
          Of course, as getting it right is extremely helpful to obtain approval with the least number of restrictions. But my sense of the “bottom line” was approval.

          Also this is pretty much the perspective of the product manager rather than say senior management. The product mangers are more like, as you say, a start-up or small company that only has one treatment in their pipeline.

          Now, I am not saying they they don’t plan and try very hard (and spend tons of money) to get it right, but once “the tires hit the road” the target is on getting approved.

          I actually have a high regard for the quality and care taken by pharma in the research they do, but also when I reviewed their work I had to keep their (huge) incentives in mind. Now, some pharma were very candid and acknowledged they hire former regulators to run mock evaluations of the review process to optimize the likelihood to obtain approval with the least number of restrictions in the actual review. Smart people.

          Also, Steve thanks again for those slides you gave me, very helpful.

    • The greatest uncertainty between pre-clinical and clinical trials is simply the leap from animal models to humans, and this uncertainty *cannot* be fully quantified statistically. Thus, if there is a way to “correctly handle” it, many of us would like to know.

      It is to actually collect data that helps you understand the processes at play. Doing that means dropping the idea that the pratices in clinical research for the last 70 years are some kind of epitome of good science, so I doubt that will happen any time soon.

  8. The exploratory/confirmatory distinction and to what extent certain exploratory analysis, visualisation and the like, can be freer (if not totally free) from models and model assumptions is very subtle and complex. Human beings tend to be prone to binary thinking and therefore a clear distinction seems to be attractive, however here much if not almost all can be found between the extremes, and much discussion (of which I don’t know to what extent it can already be found in various places) can follow.

    For me the primary thing to remember is always that models are formal and reality is not – unless operations are carried out to formalise and therefore change it. Data can maybe be understood as having forced reality into formalised form, and models make us think about what’s behind the data as some kind of formal mechanism. Both a major advantage and disadvantage of this are immediate. a) It enables us to reason in formal ways about reality and use the tools of mathematics and statistics. b) It implies that we think about reality as something that it isn’t. Maybe the most helpful thought about models is that we can use them much better if we always remind ourselves that they are essentially different from reality!

    This should make us on one hand cautious about their use and overinterpretation and give us appreciation of more direct descriptive, exploratory and visual techniques (about which we at the same time should not forget that they imply their own – if milder – implicit model assumptions), which will hopefully hypnotise us less into taking a model for reality. But on the other hand it should also legitimize to learn from models in ways that use them properly as tools for thinking and analysis, keeping in mind that they are not more than that. This could for example mean looking at the distribution of a certain statistic under a model and comparing it with the observed value and then realising that the observations are clearly incompatible with the model (that may have formalised a certain idea that we may have had about reality), or at least in this respect compatible (meaning that they cannot be used as evidence for something essentially different) – or something in between. Which is actually the basic principle of a statistical test, but without “accepting” either the null model, or any specific alternative as true.

    (Of course I agree with Andrew that much useful exploratory analysis can be done in connection with models.)

Leave a Reply

Your email address will not be published. Required fields are marked *