However, to use the models of probability to do calculations, one needs to be sure that the probabilities of various events are “coherent”. “Subjective probabilities” can make sense, as long as they are “coherent” as a system.

(See https://web.ma.utexas.edu/users/mks/statmistakes/probability.html for an attempt at a not-too-technical elaboration.)

The following question is irrelevant. The number of people who agree/disagree with a position is of no import; what matters is whether a) they understand the relevant problems and b) they have pertinent arguments to support their position. About *that* we can have a conversation.

“Degrees of certainty” are supported by the same argumentative structure as simple “certainty” and consequently fail for exactly the same reasons.

No, probabilities are not a valid way to express degrees of certainty either—unless you’re prepared to make “probability” into something entirely subjective, which would of course be completely incompatible with every other technical use of “probability”.

]]>Nothing provides certain knowledge. There is no such thing. Not even about things that are *not* about the world. Not even anything in logic and mathematics can claim to be absolute, certain knowledge, as every knowledge claim rests on assumptions. And those, like anything else, are fallible.

What deductive arguments can do is provide us with *a logically valid way to learn*. That’s Notturno’s point I quoted above. And he doesn’t “accept” anything about induction, he just says that the only thing it can do is make us believe things—which is purely psychological, subjective, and has nothing to do with either logic or knowledge.

If you’d be interested in delving deeper into these questions, I’d seriously recommend you read Popper’s *Objective Knowledge*; it’s a very good book and will answer most questions you will have on this topic, I suspect. :)

> What the use of P implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred.

[Jeffreys, H. (1961). Theory of probability (3rd ed.), p.385]

Jeffreys goes on to argue that the non-observance of more deviant results (more deviant than the test statistic d0, where Pr[d; d >= d0] = p)

> might more reasonably be taken as evidence for the law, not against it.

(The ‘law’ is the null hypothesis H0.) This with reference to a standard one-sided test statistic.

Mayo claims to debunk this argument, in the initial discussion on pp.168-170 judging p-values relative to what they claim to measure. Using an imprecise form of argument, Mayo claims that

> Considering the tail area makes it harder, not easier, to find an outcome statistically significant. . . . Why? Because it requires, not merely that Pr[d = d0; H0] be small, but that Pr[d >= d0; H0] be small.

Mayo has the argument the wrong way round. For a test statistic that has a continuous distribution, Pr[d = d0; H0] = 0. No doubt, it is the density that is in mind! This is, however, not directly comparable with a tail probability. A more nuanced examination is required.

Now, take d_α to be the test statistic such that Pr[d; d >= d_α] = α. Thus, for α = 0.05, assuming a normal distribution, d_α = 1.64. Under standard modeling assumptions, Pr[d >= d_α; H0] is indeed the probability that a calculated p-values will be less than a pre-assigned cutoff α. This comes at the cost of ignoring the more nuanced information provided by the calculated p-value.

Now, the key point! One can evaluate E[d; d >= d_α]. For α=0.05, this equals (to 2d.p.) 2.08, corresponding to p=0.019. Thus p-values that are close to 0.05 will occur relatively more frequently than the 0.05 average over the tail, while p-values that are smaller than 0.019 will occur less frequently. The numbers will change if one has, e.g., a t-distribution rather than normal distribution. The general point made remains valid. While the difference between 0.05 and 0.019 may or may not be of great consequence, it surely is important to understand the different interpretation that should apply when moving from a pre-assigned cutoff α to the calculated p-value.

On pp.246-248, Mayo refers again to arguments that P-values exaggerate. The focus moves to arguments that are based on Bayesian posteriors, Bayes factors, or likelihood ratios. For the Bayesian statistics, common choices of priors are a focus of her criticism. Her more general argument is that one philosophy should not be criticized from the perspective of another. This is surely to challenge the legitimacy, in principle, of criticisms that she directs at those who do not accept her philosophy!

The claim to have debunked Jeffreys is repeated on pp.332-333. On pp.440-441, Mayo revisits the arguments of pp.246-248.

]]>(Regrettably, these are all too often not done — often with the rationalization, “That’s the way we’ve always done it” (TTWWADI) in our field.)

]]>1) ‘Diagnostic checks’ on the modeling, while a crucial part of the exercise, are inevitably limited in what they can reveal.

2) Strong assumptions are made on what the population from which the data are drawn can, given the sampling process, reveal about the target population. What can a sample from a past population tell about the current population. (If enough time has elapsed, I’d suggest ‘Little that is useful.’ If we are lucky, the passage of a few weeks, or a few years, will not matter.)

3) We commonly take it for granted that data collection and associated modeling strategies that seem to have worked well in the past will be effective in the future, perhaps in somewhat different contexts. ]]>

I do believe you when you say that you *would* argue that, but it would be much more interesting if you actually *did*. :)

And what does “induction…with a large deductive component” even mean?

]]>As I understand her, Mayo wishes to severely limit use, both of the deductive processes of mathematical modeling and of data sources. I find her arguments for this unconvincing. These are maybe not part of her philosophy. Why not? The effect is to severely compromise the use and the bringing together of different sources of evidence.

For Mayo, the modeling that leads to p-values is, if the proper conditions are satisfied — it is a central pillar of her ‘philosophy’. The p-value that results can be used only to give a credibility (or not) to the alternative that is not to be expressed as a probability. She frowns, though allowing that there are special contexts where this is a fruitful way forward, on any attempt to move from a p-value to estimate the probability that the null is false — even where a data-based estimate of the relevant prior is available. Mayo accepts that arguments of this type have a role if one’s luggage triggers an alarm when screened at an airport — most alarms are false alarms. She acknowledges the usefulness, also, of a related methodology in gene expression studies where the aim may be to identify which of perhaps 10.000 or 20,000 genes have been expressed. Why, then, an apparent objection in principle to the use of evidence that suggests that published studies of a specified type were overwhelmingly, as in [Begley and Ellis (2012)](https://www.nature.com/articles/483531a) not reproducible. Out of 53 ‘landmark’ cancer studies, only six could be reproduced, even after checking back with the authors involved to ensure that the methodology had been correctly reproduced. There are several ways that such evidence might be used.

Evidence such as reported in the Begley and Ellis paper is on its own enough to warn anyone who hopes to reproduce one of those landmark papers that, unless they can find other evidence that distinguishes the paper of interest from the rest, they should expect to be disappointed. Indeed, [Begley (2012)](https://www.nature.com/articles/497433a?draft=journal) identifies red flags that, where the methodology is reported in sufficient detail that any red flags would be obvious, would identify at least some of the 47 papers for which results could not be reproduced. Clearly, the p-values in those papers cannot be trusted, and the more relevant evidence is that provided in the Begley and Ellis (2012) paper.

Consider, alternatively, a hypothetical 53 papers that raise no red flags, where the issue is simply that a high proportion of null hypotheses are for all practical purposes true. A high proportion of the cases where an effect is found are, then, from the 5% of those results where the test statistic falls in the relevant 5% tail. It is then entirely legitimate to use the 6 out of 53 success rate to assign a prior probability, and accordingly to use Bayes’ theorem to move from a prior probability to a false discovery rate. Such a calculation may most simply be based on a pre-assigned choice of significance level α, as in [Ioannidis (2005)](https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124). [Colquhoun (2012)](https://royalsocietypublishing.org/doi/abs/10.1098/rsos.171085) gives a more finessed calculation that uses the calculated p-value as the starting point. If no good evidence is available on which to base a prior, it is insightful to plot the false discovery rate against the prior. If p-values are to be used at all, it is surely desirable to get some rough sense of how they may relate to probabilities that are of more direct interest.

Mayo argues (p.369) that ‘[taking] the diagnostic screening model literally,’ thus moving from a small p-value to a much larger false discovery rate, is likely to compromise important work that, notwithstanding a string of failures, will finally yield its secret to a skilled and dogged experimentalist. I think this bizarre — the fact that evidence may be misused is not a reason for ignoring it!

On p.366, Mayo gives a numerical example, on the basis of which she claims that “the identification of ‘effect’ and ‘no effect’ with the hypotheses used to compute the Type I error probability and power are inconsistent with one another.” But this seems to rely on a mistake in her calculations. For the alternative against which the test has a 0.9 power has, assuming normality, α ≈ 0.115, rather than “almost 0.9”.

]]>It all starts with Mayo’s ill-considered faith in induction. Popper emphatically denied that there was *any kind* of induction—not just that as a logical process it was invalid but also that any kind of inductive reasoning was used either for theory formation or in the production of knowledge. Mayo variously claims that Popper only rejected “enumerative induction”, that corroboration via falsifying hypotheses necessitates “an evidence-transcending (inductive) statistical inference” (SIST, 83), and even (without, I should add, being able to provide any evidence) that Popper actually “doesn’t object” to calling such an inference ‘inductive’—claims that range from the wildly mistaken to the outright preposterous. Compare, for example, Popper’s treatment of Baconian induction, which goes far beyond the enumerative kind (*Logic*, *passim* and 438); his footnote in § 22 of *Logic* explaining the concept of a ‘falsifying hypothesis’; and this almost derisive put-down: “It is clear that, if one uses the word ‘induction’ widely and vaguely enough, any tentative acceptance of the result of any investigation can be called ‘induction’.” This last reply could just as well have been directed at Mayo, who certainly uses ‘induction’ vaguely enough to warrant it.

Just as weirdly, Mayo seems to be unaware that Popper had a subtly but completely different aim in mind with respect to science. For her, it is about how “humans learn about the world” and how we “get new knowledge”. For him, the “central problem” is “the problem of the growth of knowledge” (*Logic*, Preface 1959). Popper’s aim is not to find *new* knowledge but *ever better* knowledge; the difference should be obvious after a moment’s thought: “new knowledge” doesn’t even so much as imply any coherence, let alone improvement. Popper understood very well that it’s impossible to judge whether a theory is *per se* near or far from some absolute truth; that’s why everything in his methodology is about making it possible to judge whether some theory is at least *better* than some other(s). Popper’s view—entirely correct, in my estimation—is that induction is not just useless, it is not even needed.

When she dismisses deductive logic, Mayo not-so-subtly shifts the goalposts from a critic’s observation that induction is not even valid to some variation of, ‘Oh but then deductive arguments don’t ensure soundness’ (i.e. truth). Well, that’s actually not what *any* argument does. What logic can do (iff we accept the principle of non-contradiction) is to let us force ourselves to make a choice—in the logician Mark Notturno’s phrase: “No argument can force us to accept the truth of any belief. But a valid deductive argument can force us to choose between the truth of its conclusion on the one hand and the falsity of its premises on the other.” In a methodology that is about deciding which of two ideas is *better*, that is in fact all you need; again, Notturno:

If the purpose of an argument is to prove its conclusion, then it is difficult to see the point of falsifiability. For deductive arguments cannot prove their conclusions any more than inductive ones can.

But if the purpose of the argument is to force us to choose, then the point of falsifiability becomes clear.

Deductive arguments force us to question, and to reexamine, and, ultimately, to deny their premises if we want to deny their conclusions. Inductive arguments simply do not.

This the real meaning of Popper’s

Logic of Scientific Discovery—and it is the reason, perhaps, why so many readers have misunderstood its title and its intent. The logic of discovery is not the logic of discovering theories, and it is not the logic of discovering that they are true.Neither deduction nor induction can serve as a logic for that.

The logic of discovery is the logic of discovering our errors. We simply cannot deny the conclusion of a deductive argument without discovering that we were in error about its premises.

Modus tollenscan help us to do this if we use it to set problems for our theories. But while inductive arguments may persuade or induce us to believe things, they cannot help us discover that we are in error about their premises.

Consequently, Mayo is similarly off the mark when she thinks science is about marking out “approximately correct” ideas (SIST, 80). By what standard? We don’t know, because Mayo didn’t bother to say what she takes ‘truth’ to mean. She also doesn’t mention that Popper had a different idea. In Notturno’s words: “The primary task of science is not to differentiate the true from the false—it is to solve scientific problems.” For Popper, scientific theories (and hypotheses, which are substantially the same thing) are about *explanation*; if there is no explanatory theory, there are no hypotheses and there is no knowledge. Mayo effectively turns all that completely on its head (EGEK, 11-2):

I want to claim for my own account that through severely testing hypotheses we can learn about the (actual or hypothetical) future performance of experimental processes—that is, about outcomes that would occur with specified probability if certain experiments were carried out. This is

experimental knowledge. In using this phrase, I mean to identify knowledge of experimental effects (that which would be reliably produced by carrying out an appropriate experiment)—whether or not they are part of any scientific theory.

In this way, she empties all relevant terms of any possibly helpful meaning. “Inferences” are said to be “detached” by “induction”—but that is in no way meant to even imply any application of actual logic. As Notturno remarked: “Popper used to call a guess ‘a guess’. But inductivists prefer to call a guess ‘the conclusion of an inductive argument’. This, no doubt, adds an air of authority to it.” The same is, unfortunately, true for ‘hypothesis’, “or just ‘claim’”, which Mayo “will use…for any conjecture we wish to entertain” (SIST, 9)—explicitly, as she said earlier, “whether or not they are part of any scientific theory”. If you think that usage of ‘hypothesis’ carries rather strongly “the connotation of the wantonly fanciful”, Mayo specifically rules that in; Medawar, whose phrase that is, rather optimistically thought it was the bad old days when there was no “thought that a hypothesis need do more than explain the phenomena it was expressly formulated to explain. The element of *responsibility* that goes with the formulation of a hypothesis today was altogether lacking.” (Schilpp: The Philosophy of Karl Popper, 279) With respect to science’s being grounded in theories, Mayo is working mightily to resurrect an irresponsibility that was presumed happily dead long ago. Medawar quotes Claude Bernard with a prescient passage that seems all too fitting a description for what’s wrong with today’s social sciences: “A hypothesis is…the obligatory starting point of all experimental reasoning. Without it no investigation would be possible, and one would learn nothing: one could only pile up barren observations.” (Schilpp, 288)

To Mayo, though, that isn’t worth a single word. At least she is in good (or rather: numerous) company. Anything and everything to do with ‘theory’ is a huge blind spot in current social science. Everybody seems to be focused on bad, misunderstood, and allegedly broken statistics—and boy, is there a lot of bad and misunderstood statistics around. There are precious few voices in the wilderness, among them Denny Borsboom, who bluntly states: “It is a sad but, in my view, inescapable conclusion: we don’t have much in the way of scientific theory in psychology.” This “Theoretical Amnesia”, as he calls it, is the elephant in the room of the replication crisis. It even explains why some of the methodological suggestions to stem the tide of bad science, like preregistration, spectacularly miss the point:

And that’s why psychology is so hyper-ultra-mega empirical. We never know how our interventions will pan out, because we have no theory that says how they will pan out (incidentally, that’s also why we need preregistration: in psychology, predictions are made by individual researchers rather than by standing theory, and you can’t trust people the way you can trust theory).

Mind you, Borsboom himself thinks that some disciplines just “are low on theory”. Why that should be an inescapable fact, however, he does not say. And it goes without saying that *any* discipline we might care to think of today was in its history at a point where it was “low on theory”. Biology and chemistry had no theory to speak of as recently as roughly 150 ago. But it takes not just lots and lots of work (which people are obviously willing to put in) but also the good fortune to be working on problems that are rife for yielding answers—and an idea of what a unifying, explanatory theory actually looks like, which is no different in the social sciences than in any others. As it is, even Mayo’s book-length attempt to “get beyond the statistics wars” is hardly more than a baby step. Even the “severe tester” of Mayo’s imagination remains condemned, in Bernard’s sadly apt phrase, “to wander aimlessly”.

Mayo’s launching-off point for SIST is this question: “How do humans learn about the world despite threats of error due to incomplete and variable data?” (SIST, xi) Her answer, in short, is: by severely testing our claims. This is, at least in part, based on the philosophy of Karl Popper, arguably the 20th century’s most important philosopher of science: “The term ‘severity’ is Popper’s, though he never adequately defined it.” (SIST, 9) Mayo proposes that she has actually found a previously missing adequate definition, and it is this: “If [a claim] *C* passes a test that was highly capable of finding flaws or discrepancies from *C*, and yet none or few are found, then the passing result, ** x**, is evidence for

About those statistical methods, it is to Mayo’s credit that she stresses, right out of the gate, that one of the most pernicious uses of statistical methods had been anticipated and explicitly denounced by Fisher himself as early as 1935: “Fisher…denied that an isolated statistically significant result counts” (SIST, 4), going on to quote him saying that “[i]n relation to the test of significance” we need to “know how to conduct an experiment which will rarely fail to give us a statistically significant result”. That admonishment alone should have been enough, one would have thought, to preclude basing any far-reaching conclusions on a single study’s outcome (e.g. its p-value), such as “claiming a discovery”.

Similarly, it is very welcome for Mayo to point out (SIST, 82-3) that Popperian falsification is not achieved by noting one observation that contradicts a theory but only with the help of something that Popper called a “falsifying hypothesis”. It would have been helpful, however, to actually quote the relevant passage from Popper’s *Logic of Scientific Discovery*, as Mayo (partially) did in her earlier book (EGEK, 14):

We must clearly distinguish between falsifiability and falsification. …

We say that a theory is falsified only if we have accepted basic statements which contradict it. This condition is necessary, but not sufficient; for we have seen that non-reproducible single occurrences are of no significance to science. Thus a few stray basic statements contradicting a theory will hardly induce us to reject it as falsified. We shall take it as falsified only if we discover a

reproducible effectwhich refutes the theory. In other words, we only accept the falsification if a low-level empirical hypothesis which describes such an effect is proposed and corroborated. This kind of hypothesis may be called afalsifying hypothesis. The requirement that the falsifying hypothesis must be empirical, and so falsifiable, only means that it must stand in a certain logical relationship to possible basic statements….… If accepted basic statements contradict a theory, then we take them as providing sufficient grounds for its falsification only if they corroborate a falsifying hypothesis at the same time.

All well and good. But then, unfortunately, Mayo’s train of argument veers dangerously off course, eventually missing its intended target completely. One would certainly have to concur with her when she says, “The disagreements often grow out of hidden assumptions about the nature of scientific inference”. In Mayo’s case, at least, the assumptions aren’t all hidden. She is brazenly upfront about championing induction, for a start. “In getting new knowledge, in ampliative or inductive reasoning, the conclusion should go beyond the premises” (SIST, 64), she claims, and: “Statistical inference goes beyond the data – by definition that makes it an *inductive* inference.” (SIST, 7-8) Now, she is perfectly aware that induction has a bit of a problem: “It’s *invalid*, as is so for any inductive argument.” (SIST, 61) But that isn’t a problem, according to Mayo, on the contrary: “We must have strictly deductively invalid args to learn” (Twitter). Indeed, Mayo has confirmed in a separate conversation that she is “not talking of a logic of induction”. What, then, is she talking about when she talks about “inductive inference”? The answer: “error probabilities”, in line with her definition of severe tests. In fact, Mayo thinks that Popper foundered because “he never made the error probability turn” (SIST, 73).

One other assumption, however, is a little removed from plain view. This is Mayo’s assumption about the *aim* of science. This remains surprisingly vague, with “craving truth” (SIST, 7), “learn[ing] about the world” (xi), “getting new knowledge” (64), and “distinguish[ing] approximately correct and incorrect interpretations of data” (80) being our only hints as to what, in Mayo’s view, it is all about. Even more surprisingly for a professional philosopher, she never defines, or otherwise talks about, the terms ‘truth’, ‘learning’, and ‘knowledge’, as if they were self-explanatory and everybody was in agreement about their meaning—which they emphatically aren’t and obviously everybody isn’t. Other crucial terms, such as ‘inference’ and ‘hypothesis’, fare only a little better, getting a casually hand-wavy one-sentence definition each.

Not quite incidentally, these terms, and the concepts behind them, are crucial to any understanding of the philosophy of science—and especially Popper’s version of it. For all her professed sympathy for Popper’s philosophy, Mayo unfortunately either misunderstands or flat-out ignores some of the most central concepts in Popper’s philosophy. These, in fact, turn out to be the keys to a solution to the current crisis in the social sciences.

(continued below)

]]>My original post was a link to the NY Times web app where you are instructed to determine the pattern behind a series of numbers. You are allowed to repeatedly probe the algorithm by entering your own number series and the app will return whether your series follows the hidden pattern or not. When you’re finally satisfied that you have figured the rule out, you enter your final answer. It turns out that a huge percentage (~80% I think?) of those taking the test submitted their answer without once inputting a series that received a “No” answer. That is, the vast majority of participants were willing to submit an answer which they never actually challenged! They confirmed their initial guess until they got bored and then went with it.

I think this is they key to statistics, data science, machine learning, economics, physics, et al, and to the extent that Mayo’s book explores the statistical techniques that support this “vigorously beat up my model and see if it still stands” approach — i.e. the scientific method — it’s a gem. The book is a well-written survey that shows her knowledge and thinking is broad and deep. At the same time, I feel like the thesis that she might have printed out and hung on the wall as she was writing is something like: a) scientists should be rigorous, and b) frequentism, properly understood and rigorous, was right all along and Bayesians are wrong; any so-called wars were simply various misunderstandings of frequentism.

Andrew, as he’s noted in this thread, is taken seriously, but it feels like his ideas are more of a foil for completing her criticism of Bayesian inference Cat-in-the-Hat-style, with Andrew being the final Bayesian cat that refutes the prior Bayesian cat that refutes the prior Bayesian cat… but then she seems to dismiss Andrew’s ideas as either too tentative and incomplete to be useful or as a Bayesian veneer over frequentism.

]]>So if it works – it will work for anyone who knows that it works, how and what to make of it ;-)

Every one should be interested in which of the assumptions they are using that are too wrong in this particular study. So being aware of how likely you might be of becoming aware of the whats wrong should be helpful as long as the assessment of that is not too uncertain or even worse is taken as being as too certain. Even if it is severity all the way down, it can’t be just ignored.

To me – the the assessment of that is based on identifying the least wrong reference set and what would happen repeatedly in an inexhaustible collection from that reference set.

]]>But anyway, here’s why you’re wrong. Criterion (S-2) is pretty specific about what kind of event we ought to be looking at: tail areas. But unlike in the fixed sample size design, the notion of a tail area is ambiguous in designs with interim looks — if it weren’t, you would be able to answer my question. The adaptive design literature describes a few different ways of defining tail areas for the purpose of calculating p-values. It turns out that all of those candidate definitions of the notion of a tail area produce SEV functions that are certifiable garbage. In what sense are they garbage? To understand that, one actually has to be able to do the math, or at least follow along.

A severe tester who affirms that none of the candidate SEV functions that I call garbage are “true” measures of severity then has to either (i) admit that the *Error Statistics* paper’s arguments that depend on the existence of the SEV function (items #2, #3, #6, and possibly #4) do not apply to adaptive designs, or (ii) construct a SEV function that is worthy of the name, perhaps by recognizing and fixing criterion (S-2)’s deficiencies.

The point is: you have work to do.

]]>to Gelman’s comment

My book is a contribution, not to “philosophy,” but to “statistical philosophy” and, as you yourself say above, “good philosophy can facilitate both the development and acceptance of better methods”. Among the wide-ranging audience for whom my book is intended (see ”Who is the Reader of This Book”, p. xiii), my book is for statistical practitioners who want to develop the logical skills to independently scrutinize: the meaning and assumptions of their statistical inferences, how we arrived at today’s statistics wars, the arguments put forward for why one method is to be preferred to another, and the various statistical “reforms” put forward by leaders in statistics. As I say on p. 12, these tasks require “a dose of chutzpah that is out of the ordinary in professional discussions” (p. 12).

But let me now see if you will answer the question I put forward to readers in my comment #32, be it regarded as a statistical or philosophical question. Here’s the main part:

#32

To all, especially those who follow Gelman’s approach.

On SIST (p. 305) I suggest that

“for Gelman, the priors/posteriors arise as an interim predictive device to draw out and test implications of a model. What is the status of the inference to the adequacy of the model? If neither probabilified nor Bayes ratioed, it can at least be well or poorly tested. In fact, he says, ‘This view corresponds closely to the error-statistics idea of Mayo (1996)’ (ibid., p. 70)”.

The error statistical account I favor (based on quantifying and controlling well-testedness) supplies a way to qualify uncertain inferences that is neither a “probabilism” (posterior, Bayes Factor or Likelihood Ratio) nor an appeal merely to a method’s (long-run) performance. The severe testing account may supply the form of warrant in sync with Gelman’s approach (a measure of how well probed).

Yes?

]]>Comment by Gill.

Many of us have noticed the fact that the ASA doc keeps to the point nil null, despite the fact that one-sided tests are generally more apt. It’s easier to knock down the stilted significance test with a nil null–the same artificial variation on tests that we see in the 2019 article (warning us not to use the words “significant/significance”). Moreover, there is no indication of the requirements for a Fisherian test statistic (as I delineate in SIST). The document does mention that for simplicity it is leaving out a consideration of power. Nearly all the examples you see showing disagreement between the p-value and other measures (posteriors, Bayes factors) use a spike prior on the nil null. One of the more popular alternatives to p-values, J. Berger’s BFs, doesn’t work for one-sided tests–as we checked the other day when he was here. It’s too bad you weren’t in on writing the document. ]]>

Bimal Jain MD comment:

“I am a practicing physician who has found the concept of severe testing discussed by Deborah Mayo in her marvelous book SIST to be immensely helpful in understanding how statistical inference of a disease from data in a patient is done in practice.” Thank you for expressing appreciation.

]]>To Corey

I don’t assume it is non-SEV, whatever that means. If there’s a SEV computation (or even a recognition through other means) of C’s well-testedness, then the severe tester can use it too. ]]>

I just want to say that the major position in SIST is to DENY that the frequency of errors is what mainly matters in science. I call that the performance view. The severe tester is interested in probing specific erroneous interpretation of data & detecting flaws in purported solutions to problems.

]]>When statistics is decoupled from decision making, it just means there is a middle man in between. There will not be much of statistical inquiry going on that doesn’t lead to decision making. Nobody would want to fund it if it’s not really used.

]]>Justin

http://www.statisticool.com

IMO a better alternative is to requires direct compatibility. Eg define S1: ‘compatible with H’ as pvalue under H (= Poss(H)). In this case both mu = 0 and mu = 10 fail to satisfy S1 so the ‘counterexample’ is blocked.

But like I said, Mayo appears to take rejection of H as compatibility with some member of not H, ie satisfaction of S1 for some not H, which we then consider severity for. This indirect satisfaction of S1 leaves open undesirable examples. Direct satisfaction blocks it tho.

All of which amounts to: if you interpret pvalue (H) as Poss(H) and consider it for any H of interest you’re probably fine, at least in terms of simple examples.

]]>Otherwise examples like Cox and Hinkley Example 4.5 with a discrete parameter space mu in {0,10}, sigma =1 and observation of y=3 seem to leave the approach vulnerable.

Eg y=3 is very significant relative to mu=0. So we tentatively infer mu!=0, which implies S1: agrees with mu = 10 by definition of the parameter space, and want to assess severity.

SEV(mu=10) = 1 – pvalue (mu=0) = very high!

Which is pretty ridiculous…especially since SEV(mu=0) is even higher!

She claims that this example is ‘rigged’ or ‘artificial’ but it seems convincing to me that something more is needed. She talks about needing to ‘exhaust’ the parameter space, but I don’t see why we couldn’t be in a situation where we have two discrete possibilities.

If you instead just take Poss(H) = pvalue (H) then you get Poss mu=0) gt Poss(mu=10) but both are low. So you would report: each hypothesis has low compatibility with (possibility in light of) the data, though mu=0 is more compatible.

This is obviously not far off a likelihood analysis but arguably (by using tail probs) also allows separate rather than just relative measures of compatibility of each hypothesis with the data.

]]>“I think there is serious omission. What if the null hypothesis is composite????

This was also a mistake in the wikipedia definition.

I do consider it a serious mistake since almost all routine p-value computations do in fact involve a composite null hypothesis.”

As I wrote above, “These are far from perfect, but at least try to avoid the extremes of overly technical or overly simplified.” Getting into composite hypotheses would (sadly) be overly technical for all but a very few of the people (“people who encounter [p-values, etc.] in their work but do not have a strong statistical or mathematical background”) in the continuing education course for which the notes were used. For this audience, in the time-frame involved, the aim was to try to get them to develop a healthy skepticism of claims based on p-values, and perhaps to give them a better background to begin to learn more.

]]>Re: “It wasn’t until Frequentist statistical methods came along before entire fields believed there was different way to do things.”

By different you must mean better, since subjective Bayes approach allows for literally any prior to be used, possibly brittle ones that vary from person to person, even improper, modelling beliefs. Not to mention, experimental design, sampling theory, and quality control revolutionized science and the world. If I search PubMed for “p <" it shows many millions of hits, but "Bayesian" or "Bayes factor" not even over 50,000 hits. People are pretty practical, and I'd expect if Bayes is "clearly right" then practice would reflect this, but it doesn't seem to unless I've missed something.

Here are a few examples of people not "agreeing with Mayo" but coming to their perspective from logic and evidence and issues with other approaches:

-"Why I am Not a Likelihoodist", by Gandenberger (https://quod.lib.umich.edu/cgi/p/pod/dod-idx/why-i-am-not-a-likelihoodist.pdf?c=phimp;idno=3521354.0016.007;format=pdf)

-"In Praise of the Null Hypothesis Statistical Test" (https://pdfs.semanticscholar.org/37dd/76dbae63b56ad9ccc50ecc2c6f64ff244738.pdf), by Hagen

-"Will the ASA’s Efforts to Improve Statistical Practice be Successful? Some Evidence to the Contrary" (https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1497540), by Hubbard

-"The practical alternative to the p-value is the correctly used p-value" (https://psyarxiv.com/shm8v), by Lakens

-"So you banned p-values, how’s that working out for you?" (http://daniellakens.blogspot.com/2016/02/so-you-banned-p-values-hows-that.html), by Lakens

-"Assessing the Statistical Analyses Used in Basic and Applied Social Psychology After Their p-Value Ban" (https://www.tandfonline.com/doi/abs/10.1080/00031305.2018.1537892), by Ricker et al

-"On the Brittleness of Bayesian Inference" (https://epubs.siam.org/doi/pdf/10.1137/130938633), by Owhadi et al

-"Bayesianism and Causality, or, Why I am Only a Half-Bayesian" (http://ftp.cs.ucla.edu/pub/stat_ser/r284-reprint.pdf), by Pearl

-"Why Isn't Everyone a Bayesian?" (http://www2.stat.duke.edu/courses/Spring07/sta122/Readings/EfronWhyEveryone.pdf), by Efron

-"The case for frequentism in clinical trials", Whitehead

-"Legal Sufficiency of Statistical Evidence" (https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3238793), by Gelbach et al

-"Bayesian Just-So Stories in Psychology and Neuroscience" (http://psy2.ucsd.edu/~mckenzie/Bowers&Davis2012PsychBull.pdf), by Bowers et al

-"Why optional stopping is a problem for Bayesians" (https://arxiv.org/pdf/1708.08278), by Heide et al

-"In defense of P values" (https://pdfs.semanticscholar.org/1751/17f8f60e422c9e78f9766f39a2812c564e46.pdf), by Murtaugh

-"A Systematic Review of Bayesian Articles in Psychology: The Last 25 Years" (http://psycnet.apa.org/fulltext/2017-24635-002.html), by van de Schoot et al

-"On Using Bayesian Methods to Address Small Sample Problems" (https://www.tandfonline.com/doi/abs/10.1080/10705511.2016.1186549?journalCode=hsem20), by McNeish

-empirical laws, for lack of better word, of the Strong Law of Large Numbers, the quincunx making a normal distribution, and likelihood swamping priors.

Cheers,

Justin

http://www.statisticool.com

It was how you wanted criterion (S-1) to map onto POS(H) that most diverged from what Mayo had written. (Also how you wanted to map to pre-data error probs à la Birnbaum.)

]]>Pretty wide,…..pretty wide,………….pretty wide !!!!!

Yes, statistics is (should be) pretty wide. Please try to move away from the comfort zone and embrace a wide perspective.

For example:

Regarding the combination of frequentist-Bayesian views, see “Sampling and Bayes’ Inference in Scientific Modeling and Robustness”, by G. Box”, Journal of the Royal Statistical Society, Series A, Vol. 143, No. 4, pp. 421-422, 1980 – this is pretty wide.

Regarding how to view the role of statistics from a wide perspective, see my Hunter conference paper “Statistics: A Life Cycle View” Quality Engineering (with discussion), Vol. 27, No.1, pp. 111-129, 2015.

Regarding what should be the main thrust of statistics, generating information quality: “On Information Quality” (with G. Shmueli), Journal of the Royal Statistical Society, Series A (with discussion), Vol. 177, No. 1, pp. 3-38, 2014.

To me it seems essential that we take a pretty wide perspective….

]]>Nec(H) = 1-Poss(Not H)

while (for S2):

SEV(H) = 1 – Pvalue(Not H)

If you take Poss(Not H) = Pvalue (Not H) then you’re good. Luckily both Pvalues and possibility values are computed by maximising over composite hypotheses (see also Richard Gill’s comment). It always bothered me a bit that Mayo didn’t explicitly include/mention a max or min operator in SEV definition, though it is always invoked implicitly (you always consider the maximal/minimal probs over composites).

None of this is endorsed by Mayo of course, but it helps me remember how to calculate SEV since it often involves a lot of negation/double negatives…If people are interested in SEV I would recommend doing some of her problems using the logic explicitly.

Honestly though I mostly prefer to just do point estimation + propagated variability…

]]>I agree its important but not that it has to be done in the very first step.

]]>Now as all routine p-value computations do in fact involve a composite null hypothesis (e.g. two group randomized trial with binary outcomes) how and when to proceed is a critical topic. You will see a lot of discussion here regarding whether strong severity itself has been sufficiently worked up into realistic routine problems and who should do that.

As for composite null hypotheses (as you likely know better than me) many prefer taking the supremum where as I prefer to treat the p-value as a function over the composite null hypothesis, using graphs to clearly display this (easy now given fast simulation).

(Interestingly, even some statisticians are surprised by such graphs.)

This has repercussions all over the place, for instance to Wikipedia, and from there, to the whole world…

]]>It is of interest the Bayesian method has been prescribed for statistical inference in diagnosis on grounds of its rationality in terms of winning a bet with it since the early 1960s.The amazing thing to me is that the inferential accuracy of this method in practice has never been studied perhaps because as Andrew Gelman mentions in one of his posts that this method is subjective and its accuracy cannot be studied.

It would be extremely useful, I think,if statisticians studied how statistical inference is actually performed in practice in a field like diagnosis. Then as far as I am concerned, the method that is actually employed is the correct method regardless of any theoretical considerations such as rationality etc.

I think statistical inference in diagnosis in practice has not been studied so far as it is a borderline problem falling on the borders of medicine and statistics whose respective practitioners have different backgrounds and training. But the value of such study would be immense as it may improve inferential accuracy in diagnosis. ]]>

I enjoyed hunting through your slides for the definition of p-value, and I enjoyed what I found.

I think there is serious omission. What if the null hypothesis is composite????

This was also a mistake in the wikipedia definition.

I do consider it a serious mistake since almost all routine p-value computations do in fact involve a composite null hypothesis.

Richard

]]>+1 on that. I was being sloppy in my earlier writing on the topic which Mayo quoted.

]]>I should say why this argument, seemingly so ironclad, goes wrong. It’s assumed a false premise, to wit, that I am relying on some non-SEV computation of C’s well-testedness. In my argument C’s obvious correctness arises because we have a consistent estimator with the usual 1/sqrt(n) rate and an unbounded amount of data.

]]>Presumably Corey’s computation of C’s well-testedness can be computed by others as well. Therefore it can be computed by the severe tester as well. Thus, she can agree that C has passed severely, if indeed it has.

This is the sort of happy talk that sounds plausible but fails to actually get down to brass tacks. I will take it as an invitation to give you the first step — only the first step, not the complete argument — and see if you can come this far with me.

The setting is a toy model of an early stopping trial. The model is a normal distribution, mean μ, variance 1. We aim to test H0: μ ≤ 0 vs. H1: μ > 0. The test statistic is the observed mean. The maximum sample size is n = 2 with an interim look at n = 1. We reject the null on the 1st look and stop if the observed mean is greater than 2.0. We reject the null on the 2nd look if the observed mean is greater than 1.125. This gives the design a Type I error rate of 0.05.

Consider four possible outcome scenarios:

1, weak. Barely reject on the 1st look, observed mean = 2.001

1, strong. Deep in the reject region on the 1st look, observed mean = 3.0

2, weak. Barely reject on the 2nd look, observed mean = 1.126

2, strong. Deep in the reject region on the 2nd look, observed mean = 1.8

We want to know what discrepancies from the null are well or poorly supported in these scenarios. It seems clear they should give rise to different SEV functions: the strong scenarios should yield evidence for larger discrepancies from the null than the corresponding weak scenarios; and the 2nd look scenarios, being based on more data, should induce steeper SEV functions than the corresponding 1st look scenarios.

From your severity criteria we know that the inference ‘μ > m’ passes a severe test if with very high probability the test would have produced a result that accords less well with ‘μ > m’ than the actual data, if ‘μ > m’ were false or incorrect. If So we know that

SEV(μ > m) = Pr( some event ; μ ≤ m)

How can we tell, in each of the above four scenarios, what is the set of results that accord less well with ‘μ > m’ than the observed mean and sample size? In the 1st look scenarios, what part (if any) of the 2nd look sample space belongs in that set, and vice versa?

In the past you have directed me to the relevant literature on the general topic of such designs, but all of that is focused on how to achieve desired Type I error rates, how to monitor them as they proceed, and how to calculate p-values and confidence intervals at the end. Nowhere is the proper construction of SEV addressed (nor could one reasonably expect it to be).

]]>