Many perspectives on Deborah Mayo’s “Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars”

Posted on September 30, 2019 5:42 PM by Andrew

This is not new—these reviews appeared in slightly rawer form several months ago on the blog.

After that, I reorganized the material slightly and sent to Harvard Data Science Review (motto: “A Microscopic, Telescopic, and Kaleidoscopic View of Data Science”) but unfortunately reached a reviewer who (a) didn’t like Mayo’s book, and (b) felt that our article was unfocused. I can’t say much about (a), but (b) is fair enough.

Nonetheless it seemed to me that, unfocused or not, our article—this annotated collection of reactions to a controversial recent book on the philosophy of statistics (and how often can one say that???) could be of some interest within the statistical modeling, causal inference, and social science community.

So here I’m posting our article (by Andrew Gelman, Brian Haig, Christian Hennig, Art B. Owen, Robert Cousins, Stan Young, Christian Robert, Corey Yanofsky, E. J. Wagenmakers, Ron Kenett, and Daniel Lakeland). Enjoy.

17 thoughts on “Many perspectives on Deborah Mayo’s “Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars””

jrc on September 30, 2019 7:07 PM at 7:07 pm said:

I vote our own Daniel Lakeland for the win:

“So I see Andrew’s interest in frequency evaluation of Bayesian models as one manifestation of this broader concept of model checking as fitness for purpose. Our goal in Bayesian reasoning isn’t to get “subjective beliefs” or “the true value of the parameter” or frequently be correct, or whatever, it’s to extract useful information from reality to help us make better decisions, and we get to define “better” in whatever way we want. Mayo doesn’t get to dictate that frequency of making errors is All That Matters, and Ben Bernanke doesn’t get to tell us that Total Dollar Amounts are all that matter, and Feynman doesn’t get to tell us that the 36th decimal place in a QED prediction is all that matters.

And this is why “The Statistics Wars” will continue, because The Statistics Wars are secretly about human values, and different people value things differently.”

Reply ↓
- Andrew on September 30, 2019 7:55 PM at 7:55 pm said:
  
  Jrc:
  
  Sure, but it’s not the best comment ever or even the comment of the year.
  
  Reply ↓
- Deborah G. Mayo on September 30, 2019 11:45 PM at 11:45 pm said:
  
  Regarding the Lakeland comment: Anyone who actually read even the first page of my book would know I reject the claim “that frequency of making errors is All That Matters”, so I couldn’t possibly be dictating it as he alleges. Take the thumbnail blurb alone:
  
  “This book pulls back the cover on disagreements between experts charged with restoring integrity to science. It denies two pervasive views of the role of probability in inference: to assign degrees of belief, and to control error rates in a long run.”
  
  In short, Gelman is agreeing with me on this.
  
  Reply ↓
  - jrc on October 1, 2019 12:59 AM at 12:59 am said:
    
    I thought the specific examples Daniel provided (you, Bernanke, Feynman) were meant to be caricatures/stand-ins for certain (maybe naive) candidate “highest values” in some meta-philosophy of science (or some weighting in a utility function or a decision analytic framework). So I was not trying to endorse that as a fair characterization of your position.
    
    Here are the two things I did like about the comment and do want to endorse:
    
    1. the phrase “fitness for purpose” as the goal of statistical methods, models and model checking; and as the ultimate aim of statistical analysis.
    
    2. the underlying pluraralism and epistemic humility in his critique of taking sides in some “statistics wars”. Given (1), who would argue that there exists one right set of methods or models or, I think, epistemologies? What you need from statistics is going to depend on what you want to do in the world, and so there is lots of room for different kinds of approaches to, and perspectives on, learning from statistics.
    
    Reply ↓
    - Keith O'Rourke on October 1, 2019 8:22 AM at 8:22 am said:
      
      > “fitness for purpose” as the goal of statistical methods, models and model checking; and as the ultimate aim of statistical analysis.
      Believe that is the key point – we want purposeful statistics that are scientifically profitable for empirical inquiry.
      
      In fact in any reasoning process the logical purpose is likely key. For instance, in “recent Peirce scholarship, the problem of logical goodness, i.e., how ideas fulfill their logical purpose. … that goodness is attributed to ideas or statements which fulfill their purpose in the world …” https://www.commens.org/bibliography/journal_article/el-khachab-chihab-2013-logical-goodness-abduction-c-s-peirce%E2%80%99s-thought
      
      [CS Peirce] invented the name pragmatism. Some of his friends wished him to call it practicism or practicalism … but … along with nineteen out of every twenty experimentalists who have turned to philosophy … praktisch [practical] and pragmatisch [pragmatic] were as far apart as the two poles, the former belonging in a region of thought where no mind of the experimentalist type can ever make sure of solid ground under his feet, the latter expressing relation to some definite human purpose.
      WHAT PRAGMATISM IS The Monist, 15:2 (April 1905), pp. 161-181.
Deborah G. Mayo on September 30, 2019 10:11 PM at 10:11 pm said:

This is around the third time over the past ~6 months that I’ve wondered over to this blog and found something mentioning me (out of not so many times, as I’ve not had time even for my own blog). It’s terrifying. Clearly, some of these reviews are more careful and illuminating than others. I recall supplying over 100 responses and it took the better part of two days, so I’m sorry not to have my replies included, but maybe that’s too unwieldy. I hereby offer to weave comments around these reviews and make the whole thing hang together, if you’re interested. I’ve been hearing that very few psychology journals review books, which is too bad. It seems a review of SIST would fit there.

The reason I wound up over here is that I was going to email Andrew about my blogpost on the NAS report on reproducibility and replication erroneously defining P-values throughout the report.Then I thought, just in case he’s blogging about P-values, I can give the link there. So here it is. https://errorstatistics.com/2019/09/30/national-academies-of-science-please-correct-your-definitions-of-p-values/

Reply ↓
- Andrew on September 30, 2019 10:25 PM at 10:25 pm said:
  
  Deborah:
  
  I’d love it if you could share your reactions to those reviews and put this all in one place. At the very least, I can append it to the article above. It’s on Arxiv, and between Arxiv and this blog, I expect it will get more readers than what would typically be received by a book review published in a journal.
  
  I do think the above review is publishable (even more so if it includes your responses); it’s just that most journals would want some sort of rewrite to make it look more like a journal article, whatever that means. Fair enough: their journal, their call. It’s rare when a journal will want me to publish an article as is, in my own style, but when that does happen, as here, I’m happy. Indeed, I point people to that article all the time. On the minus side, I just looked it up and that article has only 39 citations, so I guess that my style, readable though it may be, might not always be the best way to engage with the scientific literature.
  
  Reply ↓
  - Deborah G. Mayo on September 30, 2019 11:58 PM at 11:58 pm said:
    
    Andrew:
    Well if you had told me you were submitting it, I would have suggested this. One way–and I wouldn’t know until I’ve reread them all–would be to group by topic along with my responses. Anyway, I’m willing.
    
    I’m a bit hurt to hear you suggest that the reviewer at Harvard Data Science Review rejected your compilation in part because (a) he didn’t like my book. What a shock to meander over to your blog and see this.
    
    Reply ↓
    - Jake on October 1, 2019 9:30 AM at 9:30 am said:
      
      There’s jerks everywhere, even at HDSR.
    - Sameera Daniels on October 1, 2019 10:16 AM at 10:16 am said:
      
      Deborah,
      I’m sure any obstacle to publishing can be surmounted.
Justin on October 2, 2019 11:37 AM at 11:37 am said:

I’m a big fan of this book and recommend it whenever I can. I believe it ties philosophy, logic, probability, statistics, and science together very well. And kudos to Wagenmakers for showing the effect of having prior=0.

Justin
https://www.statisticool.com

Reply ↓
Ron Kenett on October 3, 2019 9:29 AM at 9:29 am said:

Andres – thank you for sharing this. The randomness in the review process and the selection bias in how editors assign papers to reviewers is an interesting topic by itself.

My comment is about the statement that: “fitness for purpose” is the goal of statistical methods, models and model checking; and the ultimate aim of statistical analysis.

No, I do no think so. My statement would be that “statistics is about the generation of information quality”. This is a wide angle perspective which also considers proper data integration, chronology of data and goal, generalisation of findings, operationalisation of findings and proper communication. People want information, and the job of statistics is to generate information of quality.

Moreover, “fitness for purpose” is reminiscent of the quality movement in the 80s and the discussions then on the definition of quality. Juran nailed it by identifying two goals: 1) doing the right things and 2) doing things right. The first goal gets you to look at the “customers”, what they need, what they say they need and what they want. The second goal is about efficiencies, reduced rework, reduced mistakes etc…

Focusing in information quality means, for me, handling both objectives. Our analysis should be done right and severe testing is part of that. we should however also do the right analysis and provide the information sought for by our “customers” including an answer to the question: OK, nice analysis, what can I do with it?

I have been accused of self serving so I will refrain from providing links to the various write ups where I discuss all these issues.

Reply ↓
- Keith O'Rourke on October 3, 2019 12:24 PM at 12:24 pm said:
  
  I think there is some difference in vocabulary, as well as a clash of business (practical science) purposes versus science (of discovery) purposes arising here.
  
  For instance here, https://statmodeling.stat.columbia.edu/2019/09/30/many-perspectives-on-deborah-mayos-statistical-inference-as-severe-testing-how-to-get-beyond-the-statistics-wars/#comment-1133343 , I used the phrase “purposeful statistics that are scientifically profitable for empirical inquiry” to place it in the (science of discovery) purpose of getting less wrong about reality.
  
  “goodness is attributed to ideas or statements which fulfill their purpose in the world” (Peirce). The question becomes, then, what
  purpose do ideas or statements have?
  
  Reply ↓
  - Martha (Smith) on October 3, 2019 3:35 PM at 3:35 pm said:
    
    Good points.
    
    Reply ↓
Ron Kenett on October 3, 2019 9:30 AM at 9:30 am said:

Andrew – Sorry for the typo

Reply ↓
Peter Monnerjahn on October 6, 2019 6:15 PM at 6:15 pm said:

Given that in your paper you talk about how Mayo “expanded Karl Popper’s idea of falsification”, I was a bit disappointed to see that you hadn’t been able to find a Popper scholar to comment on what that idea was exactly and whether Mayo’s claimed expansion actually deserves to be called that. Both questions are actually crucial to a critical appraisal of Mayo’s book.

With your permission (and indulgence), I should like to add an examination of those questions to the discussion. (Apologies for cross-posting, but the other post’s comment thread has become quite unwieldy.)

Trying to keep this review relatively short, I should preface it by saying that I found SIST to be a very mixed bag of some welcome passages pointing out some context, especially in Fisher and Popper, that is usually ignored. Unfortunately, that is completely undermined by Mayo’s all-out inductivism—or, to give it its full name, “the whole discredited farrago of inductivism” (Medawar).

Mayo’s launching-off point for SIST is this question: “How do humans learn about the world despite threats of error due to incomplete and variable data?” (SIST, xi) Her answer, in short, is: by severely testing our claims. This is, at least in part, based on the philosophy of Karl Popper, arguably the 20th century’s most important philosopher of science: “The term ‘severity’ is Popper’s, though he never adequately defined it.” (SIST, 9) Mayo proposes that she has actually found a previously missing adequate definition, and it is this: “If [a claim] C passes a test that was highly capable of finding flaws or discrepancies from C, and yet none or few are found, then the passing result, x, is evidence for C.” (SIST, 14) This Mayo does not want to be misunderstood as saying that, using this method, we find true or even probable statements: “I say inference C may be detached as indicated or warranted, having passed a severe test” (SIST, 65). And this, per Mayo, is explicitly an example of “ampliative or inductive reasoning” (SIST, 64), for which a variety of statistical methods (depending on context) can be used.

About those statistical methods, it is to Mayo’s credit that she stresses, right out of the gate, that one of the most pernicious uses of statistical methods had been anticipated and explicitly denounced by Fisher himself as early as 1935: “Fisher…denied that an isolated statistically significant result counts” (SIST, 4), going on to quote him saying that “[i]n relation to the test of significance” we need to “know how to conduct an experiment which will rarely fail to give us a statistically significant result”. That admonishment alone should have been enough, one would have thought, to preclude basing any far-reaching conclusions on a single study’s outcome (e.g. its p-value), such as “claiming a discovery”.

Similarly, it is very welcome for Mayo to point out (SIST, 82-3) that Popperian falsification is not achieved by noting one observation that contradicts a theory but only with the help of something that Popper called a “falsifying hypothesis”. It would have been helpful, however, to actually quote the relevant passage from Popper’s Logic of Scientific Discovery, as Mayo (partially) did in her earlier book (EGEK, 14):

We must clearly distinguish between falsifiability and falsification. …

We say that a theory is falsified only if we have accepted basic statements which contradict it. This condition is necessary, but not sufficient; for we have seen that non-reproducible single occurrences are of no significance to science. Thus a few stray basic statements contradicting a theory will hardly induce us to reject it as falsified. We shall take it as falsified only if we discover a reproducible effect which refutes the theory. In other words, we only accept the falsification if a low-level empirical hypothesis which describes such an effect is proposed and corroborated. This kind of hypothesis may be called a falsifying hypothesis. The requirement that the falsifying hypothesis must be empirical, and so falsifiable, only means that it must stand in a certain logical relationship to possible basic statements….

… If accepted basic statements contradict a theory, then we take them as providing sufficient grounds for its falsification only if they corroborate a falsifying hypothesis at the same time.

All well and good. But then, unfortunately, Mayo’s train of argument veers dangerously off course, eventually missing its intended target completely. One would certainly have to concur with her when she says, “The disagreements often grow out of hidden assumptions about the nature of scientific inference”. In Mayo’s case, at least, the assumptions aren’t all hidden. She is brazenly upfront about championing induction, for a start. “In getting new knowledge, in ampliative or inductive reasoning, the conclusion should go beyond the premises” (SIST, 64), she claims, and: “Statistical inference goes beyond the data – by definition that makes it an inductive inference.” (SIST, 7-8) Now, she is perfectly aware that induction has a bit of a problem: “It’s invalid, as is so for any inductive argument.” (SIST, 61) But that isn’t a problem, according to Mayo, on the contrary: “We must have strictly deductively invalid args to learn” (Twitter). Indeed, Mayo has confirmed in a separate conversation that she is “not talking of a logic of induction”. What, then, is she talking about when she talks about “inductive inference”? The answer: “error probabilities”, in line with her definition of severe tests. In fact, Mayo thinks that Popper foundered because “he never made the error probability turn” (SIST, 73).

One other assumption, however, is a little removed from plain view. This is Mayo’s assumption about the aim of science. This remains surprisingly vague, with “craving truth” (SIST, 7), “learn[ing] about the world” (xi), “getting new knowledge” (64), and “distinguish[ing] approximately correct and incorrect interpretations of data” (80) being our only hints as to what, in Mayo’s view, it is all about. Even more surprisingly for a professional philosopher, she never defines, or otherwise talks about, the terms ‘truth’, ‘learning’, and ‘knowledge’, as if they were self-explanatory and everybody was in agreement about their meaning—which they emphatically aren’t and obviously everybody isn’t. Other crucial terms, such as ‘inference’ and ‘hypothesis’, fare only a little better, getting a casually hand-wavy one-sentence definition each.

Not quite incidentally, these terms, and the concepts behind them, are crucial to any understanding of the philosophy of science—and especially Popper’s version of it. For all her professed sympathy for Popper’s philosophy, Mayo unfortunately either misunderstands or flat-out ignores some of the most central concepts in Popper’s philosophy. These, in fact, turn out to be the keys to a solution to the current crisis in the social sciences. (continued below)

Reply ↓
- Peter Monnerjahn on October 6, 2019 6:20 PM at 6:20 pm said:
  
  (cont.)
  
  It all starts with Mayo’s ill-considered faith in induction. Popper emphatically denied that there was any kind of induction—not just that as a logical process it was invalid but also that any kind of inductive reasoning was used either for theory formation or in the production of knowledge. Mayo variously claims that Popper only rejected “enumerative induction”, that corroboration via falsifying hypotheses necessitates “an evidence-transcending (inductive) statistical inference” (SIST, 83), and even (without, I should add, being able to provide any evidence) that Popper actually “doesn’t object” to calling such an inference ‘inductive’—claims that range from the wildly mistaken to the outright preposterous. Compare, for example, Popper’s treatment of Baconian induction, which goes far beyond the enumerative kind (Logic, passim and 438); his footnote in § 22 of Logic explaining the concept of a ‘falsifying hypothesis’; and this almost derisive put-down: “It is clear that, if one uses the word ‘induction’ widely and vaguely enough, any tentative acceptance of the result of any investigation can be called ‘induction’.” This last reply could just as well have been directed at Mayo, who certainly uses ‘induction’ vaguely enough to warrant it.
  
  Just as weirdly, Mayo seems to be unaware that Popper had a subtly but completely different aim in mind with respect to science. For her, it is about how “humans learn about the world” and how we “get new knowledge”. For him, the “central problem” is “the problem of the growth of knowledge” (Logic, Preface 1959). Popper’s aim is not to find new knowledge but ever better knowledge; the difference should be obvious after a moment’s thought: “new knowledge” doesn’t even so much as imply any coherence, let alone improvement. Popper understood very well that it’s impossible to judge whether a theory is per se near or far from some absolute truth; that’s why everything in his methodology is about making it possible to judge whether some theory is at least better than some other(s). Popper’s view—entirely correct, in my estimation—is that induction is not just useless, it is not even needed.
  
  When she dismisses deductive logic, Mayo not-so-subtly shifts the goalposts from a critic’s observation that induction is not even valid to some variation of, ‘Oh but then deductive arguments don’t ensure soundness’ (i.e. truth). Well, that’s actually not what any argument does. What logic can do (iff we accept the principle of non-contradiction) is to let us force ourselves to make a choice—in the logician Mark Notturno’s phrase: “No argument can force us to accept the truth of any belief. But a valid deductive argument can force us to choose between the truth of its conclusion on the one hand and the falsity of its premises on the other.” In a methodology that is about deciding which of two ideas is better, that is in fact all you need; again, Notturno:
  
  If the purpose of an argument is to prove its conclusion, then it is difficult to see the point of falsifiability. For deductive arguments cannot prove their conclusions any more than inductive ones can.
  
  But if the purpose of the argument is to force us to choose, then the point of falsifiability becomes clear.
  
  Deductive arguments force us to question, and to reexamine, and, ultimately, to deny their premises if we want to deny their conclusions. Inductive arguments simply do not.
  
  This the real meaning of Popper’s Logic of Scientific Discovery—and it is the reason, perhaps, why so many readers have misunderstood its title and its intent. The logic of discovery is not the logic of discovering theories, and it is not the logic of discovering that they are true.
  
  Neither deduction nor induction can serve as a logic for that.
  
  The logic of discovery is the logic of discovering our errors. We simply cannot deny the conclusion of a deductive argument without discovering that we were in error about its premises. Modus tollens can help us to do this if we use it to set problems for our theories. But while inductive arguments may persuade or induce us to believe things, they cannot help us discover that we are in error about their premises.
  
  Consequently, Mayo is similarly off the mark when she thinks science is about marking out “approximately correct” ideas (SIST, 80). By what standard? We don’t know, because Mayo didn’t bother to say what she takes ‘truth’ to mean. She also doesn’t mention that Popper had a different idea. In Notturno’s words: “The primary task of science is not to differentiate the true from the false—it is to solve scientific problems.” For Popper, scientific theories (and hypotheses, which are substantially the same thing) are about explanation; if there is no explanatory theory, there are no hypotheses and there is no knowledge. Mayo effectively turns all that completely on its head (EGEK, 11-2):
  
  I want to claim for my own account that through severely testing hypotheses we can learn about the (actual or hypothetical) future performance of experimental processes—that is, about outcomes that would occur with specified probability if certain experiments were carried out. This is experimental knowledge. In using this phrase, I mean to identify knowledge of experimental effects (that which would be reliably produced by carrying out an appropriate experiment)—whether or not they are part of any scientific theory.
  
  In this way, she empties all relevant terms of any possibly helpful meaning. “Inferences” are said to be “detached” by “induction”—but that is in no way meant to even imply any application of actual logic. As Notturno remarked: “Popper used to call a guess ‘a guess’. But inductivists prefer to call a guess ‘the conclusion of an inductive argument’. This, no doubt, adds an air of authority to it.” The same is, unfortunately, true for ‘hypothesis’, “or just ‘claim’”, which Mayo “will use…for any conjecture we wish to entertain” (SIST, 9)—explicitly, as she said earlier, “whether or not they are part of any scientific theory”. If you think that usage of ‘hypothesis’ carries rather strongly “the connotation of the wantonly fanciful”, Mayo specifically rules that in; Medawar, whose phrase that is, rather optimistically thought it was the bad old days when there was no “thought that a hypothesis need do more than explain the phenomena it was expressly formulated to explain. The element of responsibility that goes with the formulation of a hypothesis today was altogether lacking.” (Schilpp: The Philosophy of Karl Popper, 279) With respect to science’s being grounded in theories, Mayo is working mightily to resurrect an irresponsibility that was presumed happily dead long ago. Medawar quotes Claude Bernard with a prescient passage that seems all too fitting a description for what’s wrong with today’s social sciences: “A hypothesis is…the obligatory starting point of all experimental reasoning. Without it no investigation would be possible, and one would learn nothing: one could only pile up barren observations.” (Schilpp, 288)
  
  To Mayo, though, that isn’t worth a single word. At least she is in good (or rather: numerous) company. Anything and everything to do with ‘theory’ is a huge blind spot in current social science. Everybody seems to be focused on bad, misunderstood, and allegedly broken statistics—and boy, is there a lot of bad and misunderstood statistics around. There are precious few voices in the wilderness, among them Denny Borsboom, who bluntly states: “It is a sad but, in my view, inescapable conclusion: we don’t have much in the way of scientific theory in psychology.” This “Theoretical Amnesia”, as he calls it, is the elephant in the room of the replication crisis. It even explains why some of the methodological suggestions to stem the tide of bad science, like preregistration, spectacularly miss the point:
  
  And that’s why psychology is so hyper-ultra-mega empirical. We never know how our interventions will pan out, because we have no theory that says how they will pan out (incidentally, that’s also why we need preregistration: in psychology, predictions are made by individual researchers rather than by standing theory, and you can’t trust people the way you can trust theory).
  
  Mind you, Borsboom himself thinks that some disciplines just “are low on theory”. Why that should be an inescapable fact, however, he does not say. And it goes without saying that any discipline we might care to think of today was in its history at a point where it was “low on theory”. Biology and chemistry had no theory to speak of as recently as roughly 150 ago. But it takes not just lots and lots of work (which people are obviously willing to put in) but also the good fortune to be working on problems that are rife for yielding answers—and an idea of what a unifying, explanatory theory actually looks like, which is no different in the social sciences than in any others. As it is, even Mayo’s book-length attempt to “get beyond the statistics wars” is hardly even a baby step. Even the “severe tester” of Mayo’s imagination remains condemned, in Bernard’s sadly apt phrase, “to wander aimlessly”.
  
  Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Many perspectives on Deborah Mayo’s “Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars”

17 thoughts on “Many perspectives on Deborah Mayo’s “Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars””

Leave a Reply Cancel reply