The philosopher wrote:

The big move in the statistics wars these days is to fight irreplication by making it

harderto reject, and find evidence against, a null hypothesis.

Mayo is referring to, among other things, the proposal to “redefine statistical significance” as p less than 0.005. My colleagues and I do not actually like that idea, so I responded to Mayo as follows:

I don’t know what the big moves are, but my own perspective, and I think that of the three authors of the recent article being discussed, is that we should not be “rejecting” at all, that we should move beyond the idea that the purpose of statistics is to reject the null hypothesis of zero effect and zero systematic error.

I don’t want to ban speech, and I don’t think the authors of that article do, either. I’m on record that I’d like to see everything published, including Bem’s ESP

~~paper~~data and various other silly research. My problem is with the idea that rejecting the null hypothesis tells us anything useful.

Mayo replied:

I just don’t see that you can really mean to say that nothing is learned from finding low-p values, especially if it’s not an isolated case but time and again. We may know a hypothesis/model is strictly false, but we do not yet know in which way we will find violations. Otherwise we could never learn from data. As a falsificationist, you must think we find things out from discovering our theory clashes with the facts–enough even to direct a change in your model. Even though inferences are strictly fallible, we may argue from coincidence to a genuine anomaly & even to pinpointing the source of the misfit.So I’m puzzled.

I hope that “only” will be added to the statement in the editorial to the ASA collection. Doesn’t the ASA worry that the whole effort might otherwise be discredited as anti-science?

My response:

The problem with null hypothesis significance testing is that rejection of straw-man hypothesis B is used as evidence in favor of preferred alternative A. This is a disaster. See here.

Then Mayo:

I know all this. I’ve been writing about it for donkey’s years. But that’s a testing

fallacy. N-P and Fisher couldn’t have been clearer. That does not mean we learn nothing from a correct use of tests. N-P tests have a statistical alternative and at most one learns, say, about a discrepancy from a hypothesized value. If a double blind RCT clinical trial repeatedly shows statistically significant (small p-value) increase in cancer risks among exposed, will you deny that’s evidence?

Me:

I don’t care about the people, Neyman, Fisher, and Pearson. I care about what researchers do. They do something called NHST, and it’s a disaster, and I’m glad that Greenland and others are writing papers pointing this out.

Mayo:

We’ve been saying this for years and years. Are you saying you would no longer falsify models because some people will move from falsifying a model to their favorite alternative theory that fits the data? That’s crazy. You don’t give up on correct logic because some people use illogic. The clinical trials I’m speaking about do not commit those crimes. would you really be willing to say that they’re all bunk because some psychology researchers do erroneous experiments and make inferences to claims where we don’t even know we’re measuring the intended phenomenon?

Ironically, by the way, the Greenland argument only weakens the possibility of finding failed replications.

Me:

I pretty much said it all here.

I don’t think clinical trials are all bunk. I think that existing methods, NHST included, can be adapted to useful purposes at times. But I think the principles underlying these methods don’t correspond to the scientific questions of interest, and I think there are lots of ways to do better.

Mayo:

And I’ve said it all many times in great detail. I say drop NHST. It was never part of any official methodology. That is no justification for endorsing official policy that denies we can learn from statistically significant effects in controlled clinical trials among other legitimate probes. Why not punish the wrong-doers rather than all of science that uses statistical falsification?

Would critics of statistical significance tests use a drug that resulted in statistically significant increased risks in patients time and again? Would they recommend it to members of their family? If the answer to these questions is “no”, then they cannot at the same time deny that anything can be learned from finding statistical significance.

Me:

In those cases where NHST works, I think other methods work better. To me, the main value of significance testing is: (a) when the test

doesn’treject, that tells you your data are too noisy to reject the null model, and so it’s good to know that, and (b) in some cases as a convenient shorthand for a more thorough analysis, and (3) for finding flaws in models that we are interested in (as in chapter 6 of BDA). I would not use significance testing to evaluate a drug, or to prove that some psychological manipulation has a nonzero effect, or whatever, and those are the sorts of examples that keep coming up.In answer to your previous email, I don’t want to punish anyone, I just think statistical significance is a bad idea and I think we’d all be better off without it. In your example of a drug, the key phrase is “time and again.” No statistical significance is needed here.

Mayo:

One or two times would be enough if they were well controlled. And the ONLY reason they have meaning even if it were time and time again is because they are well controlled. I’m totally puzzled as to how you can falsify models using p-values & deny p-value reasoning.

As I discuss through my book, Statistical Inference as Severe Testing, the most important role of the severity requirement is to block claims—precisely the kinds of claims that get support under other methods be they likelihood or Bayesian.

Stop using NHST—there’s speech ban I can agree with. In many cases the best way to evaluate a drug is via controlled trials. I think you forget that for me, since any claim must be well probed to be warranted, estimations can still be viewed as tests.

I will stop trading in biotechs if the rule to just report observed effects gets passed and the responsibility that went with claiming a genuinely statistically significant effect goes by the board.That said, it’s fun to be talking with you again.

Me:

I’m interested in falsifying real models, not straw-man nulls of zero effect. Regarding your example of the new drug: yes, it can be solved using confidence intervals, or z-scores, or estimates and standard errors, or p-values, or Bayesian methods, or just about anything, if the evidence is strong enough. I agree there are simple problems for which many methods work, including p-values when properly interpreted. But I don’t see the point of using hypothesis testing in those situations either—it seems to make much more sense to treat them as estimation problems: how effective is the drug, ideally for each person or else just estimate the average effect if you’re ok fitting that simpler model.

I can blog our exchange if you’d like.

And so I did.

Please be polite in any comments. Thank you.

**P.S. Tomorrow’s post: My math is rusty.**

“I just don’t see that you can really mean to say that nothing is learned from finding low-p values, especially if it’s not an isolated case but time and again.”

Interesting that Mayo believes that evidence should be combined across analyses and would want to combine prior information (the results of prior analyses) into her posterior beliefs about parameters after new analyses. If only there were a statistical framework that built for that sort of reasoning…

But insofar as the combination of evidence is to take place in a formal framework Mayo will insist on sampling distributions playing the key role. Probably the closest statistical method out there that fits

thatbill is confidence-distribution-based meta-analysis.Imagining a prior is scarcely the only or even the best way to combine background information to arrive at inferences.

To quote The Dude, “…that’s just, like, your opinion, man” ;)

We don’t just “imagine” priors anymore than we just “imagine” likelihoods. We can also check/test the prior just like any other part of the model. Anyhow, “best” is going to be in the eye of the beholder here – ideally we would agree on a set of criteria for “best”, and then show how different methods perform in some real-world examples. I am not holding my breath for anything consistently ‘better’ than Bayes for many applications of interest to me and other scientists…

I am intrigued by what John Ioannidis refers to as the ‘Janus’ phenomenon and ‘vibration of effects’. He suggests that any research result can be finagled. I wonder to what extent we can make substantive progress in biomedical fields, given these two concepts.

The other thing is that I don’t recall Deborah being anti-NHST & P values; at least several years back when she 1st tweeted on the subject to Metrics Twitter. Plus she signed on to Lakens et al Justify Your Alpha.

That said, I am a little confused by John position/reasoning in recent articles about Statistical Significance b/c John had routinely pointed out the nuisance resort to statistical significance and p values in top journal articles.

In short, I don’t find temporal consistency in the claims made by several statisticians. Feel free to tell me I’m wrong. Whatever.

My impression is that even the leading voices/names are always learning and evolving (at least somewhat) in their positions. There’s been a lot of flux in this area – but that’s to be expected! We see the same thing with evolving attitudes towards uninformative priors in this Stan corner of the Bayesian universe. Where it gets frustrating to me is when statisticians blame researchers for being confused or implementing obviously silly methods and not seeking their advice, when it is painfully obvious that there is no univocal recommendation to be had from statistics as a discipline. For every Andrew Gelman out there, there are numerous card-carrying statisticians who are still prescribing some variant of the NHST/p-value driven framework during their consultations.

I have always taken the approach of trying to master and understand the machinery that I am using – but I won’t claim any perfection in having always attained this end, and I increasingly can’t blame other researchers for not making the investment. There simply isn’t time/energy if you didn’t do the heavy lifting in grad school…

and I should add, in turn, that the heavy lifting in grad school needs to be built on at least half-way decent quantitative foundation in applied math, otherwise you are not going to get very far without an extraordinary amount of time/energy.

I agree with this. Unfortunately I don’t think many mathematicians are able to teach applied math concepts well at an u-grad or beginning grad level.

From Carver Mead’s intro to Sanjoy Mahajan’s “Street Fighting Mathematics”:

“Most of us took mathematics courses from mathematicians — Bad Idea!

Mathematicians see mathematics as an area of study in its own right. The rest of us use mathematics as a precise language for expressing relationships among quantities in the real world, and as a tool for deriving quantitative conclusions from these relationships. For that purpose, mathematics courses, as they are taught today, are seldom helpful and are often downright destructive.”

It might be a bit overstated, but I definitely feel as someone who did a Math major that I had to learn a whole second kind of mathematics post graduation to do applied work.

PS: download the book here: https://mitpress.mit.edu/books/street-fighting-mathematics see the “Download PDF” link in the lower left.

Daniel:

Regarding Street Fighting Mathematics, see here.

Seconded! Sanjoy Mahajan’s book is awesome. I have always felt that higher math education focuses on selecting for the wrong kind of intuition – intuition for proofs – when what is actually important in applied work is exactly the kind of ‘street fighting’ intuition that Mahajan illustrates. Learning how to use the basic tools really really well and seeing connections between applications in different domains are the two keys, IMHO :)

I would speculate that while applied math may be useful, fundamentally it’s far more a matter of fluid and crystallized intelligence b/c I have heard extremely sophisticated mathematicians engage in cognitive biases and move goalposts. Above all, it is also a matter of having none or few conflicts of interests.

Let’s be honest, applications of math have thus far mapped all that well complex biological dynamics. Here I think Michael Bastland is quite insightful.

Apologies, I meant ‘have not applied all that well’.

“leading voices/names are always learning and evolving… there is no univocal recommendation”

This is a really important point. In research, you need a proven a *method*, not an evolving collection of ideas. The method can have tuning points (ie, you make adjustments to the mass spec when doing isotopic analyses) – it doesn’t have to be 100% rote – but it can’t be wide open to opinion.

This may be in part why NHST is so persistent: people don’t recognize an alternative *method*.

If by “method” you mean using basic logic as extended to multi-value plausibility through Bayesian probability, then sure… if by method you mean “a set of computer programs you can run and get the answers that people won’t attack” like SAS PROC BIOEXPERIMENT then no.

The only method of doing science is to figure out how to do careful measurements, and to built models that unify features of the world into an at least moderately successful predictive frameworks. There is no mechanistic “method” of model building and it *is* always going to be wide open to opinion. Only comparison to experiment filters out the good from the bad opinions

(from Feynman)

https://www.youtube.com/watch?v=b240PGCMwV0

“The only method of doing science is to figure out how to do careful measurements, and to built models that unify features of the world into an at least moderately successful predictive frameworks.”

??? Where do you get this idea that everyone builds everything independently from first principles every time they do an experiment? That never happens in science! :) There are thousands of more or less “cookbook” methods that are used all across science (and in the economy) with resounding success.

Surely today’s exoplanet seekers are using a few more or less cookbook methods. Genetics researchers are using myriad methods, some of general application, some specific to their field of research. There’s a method for determining the hypocenter of an earthquake and a method for determining it’s magnitude. There’s a method for U-Pb geochronology. There is a method for assessing depression. There’s a method for assessing ability.

Methods are advanced, tested, modified, retested and continually modified all the time in science.

There are lots of things that scientists do that aren’t science. Driving to work in the morning, applying for grants, brushing their teeth, and yes hooking up oscilloscopes to lab instruments and running canned code on seismograph data and soforth.

As soon as a thing is known to have certain reliable properties, doing it repeatedly is basically just engineering. One of the major areas of engineering used in science is basically “metrology” writ-large, that is, the creation of instruments of measurement in general. Unless you’re studying the properties of measurement, simply using a method to generate a measurement isn’t by itself science, it’s an important task, but we don’t advance scientific understanding by pure measurement alone (though sometimes we need to just measure stuff for a long time until we can come up with some useful theories I admit).

I was using science to mean the application of the scientific method to understanding questions for which the answer is not already known. The scientific method is more or less what Feynman discussed.

So, sure, there are lots of engineering challenges that scientists need to address, and they need reliable methods for addressing them, so they can move beyond measurement and engineering and start building theories of how things work… that’s the special task that scientists have.

Mayo is now on record here saying NHST is a bad thing… that’s good in my opinion. But typical NHST is not the only way to use p values, and Mayo is strongly in favor of p values.

For example, here’s a way to use p values that isn’t NHST:

after observing the effect of a drug on 100 patients we develop an empirical distribution over the effect of the drug (basically a histogram or other fancier way of estimating a distribution, like a kernel density). Now, we make a modification to the drug and our biochemical knowledge suggests that this should “fix” problems with the drug that lead to low outcomes and move them into the higher outcome range. Next we give the drug to 100 patients, and estimate the distribution of outcomes, and use a p value to determine if we have sufficient evidence of a particular predicted type of shift in the distribution. We fail to reject the idea that the new data set comes from the same distribution as the old data set, and so we do a cost benefit analysis, and based on the fact that the new modified drug is more expensive to manufacture, and that we’d need to treat much more than 100 patients to gain any benefit it might have which we couldn’t measure accurately, we decide to scrap the drug and focus our efforts on developing a different drug…

Not an NHST usage, but a valid way to potentially use p values, while making decisions on the basis of utility. I’d still argue for a Bayesian analysis, but if I saw this kind of thing being done in biology, I’d be ecstatic.

This. The quest to automate the discovery of knowledge like the production of so many widgets coming off an assembly line, but without any cost/benefit input for defective widgets, is why (IMHO) NHST became the irredeemable disaster that it is.

OK, Daniel,

I was suggesting that Deborah has not claimed this or that for many years, as it is noted above in the main post. Rather she had worded some proposals differently in different contexts. That was my initial impression back three years ago. Unfortunately, I didn’t save the Twitter quotes of different statisticians. In some cases, there seemed to overgeneralize the inferences contained in some articles.

I don’t see how ascertaining the utility of the p values translates into the communicating risks to patients and consumers. You must deal largely with other stakeholders. Right?

Reality is unless we can evaluate the raw data, all this is a bit abstract. From what I understand roughly 2 % of biomedical industry shares data. It’s mostly propriety.

Daniel, this looks like usual NHST-variety practice to me. The only differences I see are (A) that you do not mention a specific p-value threshold, and (B) that you are perhaps calculating the p-value for a null hypothesis of a non-zero difference.

But if you are not using a threshold, how do you “use a p value to determine if we have sufficient evidence of a particular predicted type of shift in the distribution”?

Also, if you are using a non-zero null hypothesis, by what procedure do you select the non-zero value?

PS, In the first sentence, I think you mean “outcomes” instead of “effects”. As in, your sentence should read “after observing the outcomes of a drug on 100 patients we develop an empirical distribution over the outcomes of the drug “. An easy typo to make, but an essential difference. No?

As I said, “I’d still argue for a Bayesian analysis”.

The right way to do this analysis would be to write down a utility over outcomes, do a Bayesian estimate of the outcome distribution under each drug, and choose the drug that overall produces better expected utility integrated over the Bayesian posterior for the outcome distribution (a distribution over distributions).

The proposed example differs from typical NHST in that

1) It uses an observed distributional shape over outcomes with a largish sample rather than a typical assumed distribution (via a CLT) over a single statistic of the outcomes (such as a mean)

2) It takes an action based on “fails to reject” but we can choose whatever threshold we want here, and then in so doing assumes that the existing known distribution of outcomes is a reasonable model for the future outcomes under the same drug… a short-cut to the full Bayesian analysis. You’d also probably choose a goodness of fit measure, so you’re looking for a predicted type of change in the histogram so to speak rather than a “default” null hypothesis on a single statistic relying on CLT or something to give you a distribution to compare to.

3) It makes the decision based on utilities, such as cost of the drug, and a valid Bayes-like assumption that if you can’t detect a difference between two samples then the size of the difference is bounded above by some amount which is evidently insufficient to justify the increased cost.

4) Presumably you might be testing say hundreds of variations on the drug, therefore being able to actually take advantage of frequency statistics on the properties of the method, if you can get one out of hundreds of drugs to have a dramatic noticeable effect, then your inability to say what was going on in many other trials doesn’t really matter. It really is about the frequencies.

Again I don’t advocate this, I advocate making decisions with a full Bayesian Decision Theory solution, but nowhere in my proposal is any serious logical fallacy committed in the way it is often with standard NHST applications in say psych are often performed:

“we theorize that xyz should happen, we reject the null of no difference in mean values between outcomes, therefore our theory is true” etc

Fair enough that you prefer a Bayesian analysis. However, I still don’t understand the difference between your proposal and NHST, in that your proposal and the following example look essentially the same to me:

Drug X is used by patients. A company thinks they have made an improved version of drug X, called drug Y. So they randomise 100 patients to drug X and 100 patients to drug Y. They then compare the distribution of outcomes experienced in the two trial arms and run some kind of statistical test for the difference. They might test for differences in means, log means, distributions (KS test, for example), or whatever. The test fails to reject. Then the company thinks …ok, should we perform a larger trial? They conduct a formal or informal cost-benefit analysis, according to which running a larger trial doesn’t seem worth it. So they stop.

The only big difference between your proposal and the above example seems to be the use of randomisation, which I can’t imagine you object to. So it’s difficult for me to pinpoint where your strong objection to NHST originates.

In my own view, trial p-values should not be dichotimised by researchers, they should be reported for interpretable differences between trial arms, and they should always be accompanied by point estimates and confidence intervals. Then it should be the responsibility of regulatory agencies to decide what gets approved. Dichotimisation of p-values may be an ugly but practically important contributor to the regulatory decision, which is a sausage-making process, not a statistical ideal. One thing that generally should not contribute to the regulatory decision is cost-benefit analyses from industry, which make lying easy.

Apart my dislike for industry cost-benefit analysis in medicine, I don’t hold my views strongly — I could be conviced to change them.

Finally, regarding your psych example, I agree that is bad practice, but it is seems very different from medical trials.

For the most part, drug industry trials are not what you’d call NHST, the key components of NHST are first assuming that a randomization process is a good model for whatever science you are doing (an unproblematic assumption if you actually use a tested random number generator) and then the action of taking a default threshold p value and making a decision to “act as if” either the null hypothesis is true or the favored hypothesis is true on the basis of the threshold.

if you use randomization, and you select a hypothesis based on a specific predicted theoretical deviation, and you base the test on the acual observed distribution of outcomes not a theoretically convenient distribution, and calculate a p value and you use the p value in a utility based decision it isn’t as good as a Bayesian analysis but it does mitigate most of the worst aspects of NHST.

the most useful use of a p value is as a screening tool to reduce the number of things you have to spend money and effort investigating. So for example you give 100 drugs to mice and screen out all but those that give a p value smaller than 0.01, and then you do careful studies of toxicity and effectiveness on just those 6 or whatever. Implicitly this is a utility based decision, you know the cost of doing the full analysis is high, and the average cost of screening out a random drug is low since most drugs are unhelpful.

Fair enough that you prefer a Bayesian analysis. However, I still don’t understand the difference between your proposal and NHST, in that your proposal and the following example look essentially the same to me:

Drug X is used by patients. A company thinks they have made an improved version of drug X, called drug Y. So they randomise 100 patients to drug X and 100 patients to drug Y. They then compare the distribution of outcomes experienced in the two trial arms and run some kind of statistical test for the difference. They might test for differences in means, log means, distributions (KS test, for example), or whatever. The test fails to reject. Then the company thinks …Ok, should we perform a larger trial? They conduct a formal or informal cost-benefit analysis, according to which running a larger trial doesn’t seem worth it. So they stop.

The only big difference between your proposal and the above example seems to be the use of randomisation, which I can’t imagine you object to. So it’s difficult for me to pinpoint where your strong objection to NHST originates.

In my own view, trial p-values should not be dichotimized by researchers, they should be reported for interpretable differences between trial arms, and they should always be accompanied by point estimates and confidence intervals. Then it should be the responsibility of regulatory agencies to decide what gets approved. Dichotimization of p-values may be an ugly but practically important contributor to the regulatory decision, which is a sausage-making process, not a statistical ideal. One thing I think generally should not contribute to the regulatory decision is cost-benefit analyses from industry, which make lying easy.

Apart from my deep dislike for industry cost-benefit analysis in medicine, I don’t hold my views strongly — I could be conviced to change them.

Finally, regarding your psych example, I agree that is bad practice, but it seems very different from medical trials.

Sorry for double post.

Your last lines describes the deranged NHST again. It’s like saying Modes Ponens is actually invalid because lots of people fallaciously regard affirming the consequent as valid (they think they’re doing Modus Ponens, but they’re doing a deranged version that has some similarities in the first premise)

I’m not anti-statistical significance or P-values even though I reformulate them in ways so that classic fallacies of significance/non-significance, violated stat assumptions, problems in linking statistical and substantive claims. I do say to drop the term “NHST”, see p. 438 Farewell Keepsake in my book SIS (2018, CUP).

I think what Ioannidis is rejecting, and I agree with him, is throwing overboard prespecified thresholds in testing.

What is SIS?

This came out as from “Anonymous” although it was to be from Mayo. I must have pressed “submit” too readily. Hope I didn’t confuse.

It is certainly wrong that “any research result can be finagled”. The problem is that the basic research paradigm originated in the physical sciences, where genuine experiment (isolated systems closely approximating the studied one examined under highly controlled conditions) is possible and regularly undertaken, and is accompanied by robust theoretical models/frameworks with well-delineated regimes of correctness, has been exported to contexts (economics, social sciences, etc.) where the conditions that make it viable, genuine experiment and robust models, are much less clearly in place. In such contexts, where “experiment” means something very unlike what a physicist means by experiment, and the models are not at all robust, nor their regimes of applicability well delineated, it is indeed possible to “finagle” a result by tweaking this or that aspect of the experiment or the model. In the physical sciences it also helps that all interesting results are immediately examined for their applicability to finding energy resources, building weapons, and controlling society, and this pragmatic aspect generates an external check on everything that is often lacking (even in pharmaceuticals – drug pushers don’t care whether the drugs work, they care whether they can sell them – the government cares that its missiles will reach the Korean peninsula).

Whenever faced with a personal medical decision I have found reading about clinical trial results to be almost completely worthless. A couple case studies would be far more informative, which is why more and more people are turning to internet forums, etc for medical advice.

I thought that Gerd Gigerenzer’s new book ‘Risk Savvy’ was quite useful. With some revisions of a few chapters, it would be even better for consumers of statistics. So much of the information provided to patients by their own doctors are not particularly useful either b/c, in part, doctors may not understand or misunderstand the utility of the trial results and tests. They are unable to handle basic and sophisticated questions put to them.

But the underlying concern is the quality of the data from which results are drawn.

Even assuming it is accurate, the type of information the doctors can find in the literature isn’t suitable to answer many questions patients have to begin with. You can’t blame doctors for this.

Hi Anoneuoid,

I would disagree with you in this sense: Doctors are daily confronted by the treatment effects. So that is one source of information. Also I understand that the Cochrane Initiative is a reputable source of information. Moreover, several doctors have written accounts of patient experiences with cancer treatments.

I am not blaming doctors per se. However, doctors go through extensive education: pulled from the ranks of the most ‘analytically’ talented undergraduates. If that is the case, then their education contains some gaps analytically, the parameters of which are hard to ascertain.

Doctors are often the 1st line of information. As you say that the literature may not be suitable. Why would it not occur to them that it might be suitable?

Now here Gerd Gigerenzer gets a little fuzzy. He points out that doctors don’t have access to journal articles. So that leaves their CME courses, which are funded by pharmaceutical companies.

Eh, I was pre-med and later took a lot of medical school classes with all the future doctors. I was probably in the top 1% in asking questions and being “analytical” in every single class like that. Medical training is mostly just memorization. Later I discovered much of what I had memorized was based on some very questionable evidence to say the least.

I suspect this is closer to the truth. Med school students of course have some high performing “analytic” thinkers, but if you were going to go by say median ability to synthesize valid ideas from raw information, I’d suspect that Math, Physics, all types of Engineering, Chemistry, Biology, Geology, Ecology, Econ, and Philosophy all outperform the pool of “pre-med students”. The principal skill that pre-med/med students seem to have is to remember large quantities of standardized stylized facts and regurgitate them on standardized tests.

In many ways medical school is like the very first application of machine learning: feed a lot of case studies and training information into a biological neural network, and then have it learn to recognize the patterns we told it about.

It’s actually not a terrible strategy, but it *does* become terrible when the training material is polluted with a lot of noise and we over-fit.

My impression (from undergraduate advising) is that math majors applying to med school have higher acceptance rates than students from most other majors.

Art and foreign-language majors had the highest acceptance rates at my medical school — they had diverse backgrounds (which medical schools want) and showed dedication to take courses very different from their major in order to satisfy the required topics for the MCAT test.

Mathematical and statistical skills/talent/understanding is NOT required and NOT taught by any medical school that I know of (I’m a practicing doctor who teaches residents.)

From one of the founders a couple years ago re Cochrane Initiative NOT being reputable source of information “they have to make do with what they have”.

Currently that’s mostly selectively reported published reports that got by the journal editors/reviewers.

There is simply no practical way to get better information than what is actually in the published reports.

Sad, but probably true.

Greetings Keith,

Which published reports do you recommend?

Sadly mainly just those done by people I have worked with and trust – though maybe if one does get access to the raw data.

Now if clinical reports submitted to regulators are made public (with no hint of selectively made available) then I would trust those the most. There is some hope that this will be the future.

Now if published papers were randomly audited for accurately reflecting what was done and observed we would have an error rate which may allow to trust to a certain degree.

Sameera said, “Doctors are daily confronted by the treatment effects.”

I’m skeptical — my impression is that not many doctors observe treatment effects (both positive and negative) very carefully and consistently — let alone share their observations so that patients and other physicians can use that information in making decisions regarding treatment.

Martha,

In having lived in Boston, I did come across some who were very astute. I speculate that many do not understand and misunderstand the utility of tests and technologies. But that is also the case with specialists in many disciplines.

This is all why patients and consumers of statisticians have to become more attuned to these controversies in statistics and medical treatments.

Sameera:

Yes, the inferential & data analysis tools are needed to reach claims, assess evidence, find things out, and highlight claims that are poorly tested (at least, thus far). While consumers need to make individual decisions, doing so is enabled only insofar as the previous tasks are performed reliably.

Greetings Deborah,

I hear you. I disagree that consumers have to depend entirely on what expertise holds. Who is to determine which tasks are being performed reliably? As it is consumers/patients are already sharing their experiences with each other on Facebook and Twitter. So stakeholders may be paying attention.

Sameera: I didn’t say entirely, it’s a necessary but not sufficient component.

Hi Deborah,

I think this presentation by John Ioannidis offers an interesting case study and insights into why consumers/patients are concentrating on qualitative side/evidence. When experts explain their research results in plain English [without jargon] then it is feasible that consumers/patients can also add value to research-beyond the trials.

https://www.youtube.com/watch?v=OBzcvsBtS34

In clinical research, replace “time and again” by “study site and again”. I remember a study we published in Lancet which lead to a recommendation for treatment based on p-values. When others tried to replicate it with a larger patient group, the recommendation no longer could be maintained. Digging into the data (bad, post-hoc….) we found that the patient subgroups in the two hospitals were selected slightly different, even if inclusion criteria etc. where the same.

A clearer case: Two groups in participated in a study where bone material was implanted. Both groups were highly skilled, and it was planned to pool the two studies. I found out that the bone growth in one group was much better than in the other. Luckily, both groups leaders were professionals and carefully looked at the details. Before implantation, the bone material had to be grounded in a mortar, and the person in the “bad” group did this much more thoroughly, the other left some bits. So the recommendation was on the way to handle the pestle instead of the patient.

The critical case in this example is: the leader in the “bad” group was willing to investigate. In most other cases I remember, the statistician was “ordered” to clean up the mess.

Exploring the data can never be bad. This why NHST is even worse than pseudoscience… it is bizarro science. Everything it encourages is the exact opposite of what people should be doing.

Here is a great example of people using statistics to test

theirhypothesis instead of a strawman hypothesis. You can see that when deviations from the predictions oftheirmodel are observed the scientists are motivated to include every single source of uncertainty they can think of until it “goes away”:https://arxiv.org/abs/1204.2507

This is the exact opposite of what NHST encourages people to do.

Collecting data is [in most cases] expensive.

Assuming we can agree on that, it makes sense to squeeze as much out of the data as we can. We ALL do this. We SHOULD do this (and Dieter’s example is a good one.)

The problem isn’t “digging into the data (bad, post-hoc…)”. The problem is not being honest about this, and presenting the garden path as the only one investigated. Lack of journal space is no longer a valid reason to omit the other paths investigated, since extra material can be linked to outside the article itself.

The problem is the NHST discourages you from being honest about it. Good scientific practice is sinful according to NHST. That is Meehl’s famous “paradox”:

Paul E. Meehl, “Theory-Testing in Psychology and Physics: A Methodological Paradox,” Philosophy of Science 34, no. 2 (Jun., 1967): 103-115. https://doi.org/10.1086/288135

Thanks for posting this. I had not heard of Meehl’s paradox, but it precisely agrees with what I often think as a physicist when reading about the statistics wars.

The most common beginning phrase in the title of a high energy physics paper is “Search for”, and the paper almost always reports a null result. We don’t claim a discovery unless the data are very inconsistent with the null hypothesis. One reason we want high “statistical significance”(e.g. the famous “5 sigma” discovery threshold) is so we can report on all the ways we have sliced and diced the data to check for inconsistencies that might reveal systematic errors. The problem with “digging to the data” is when the default assumption is that any inconsistencies with the null hypothesis are “discoveries”, not evidence for systematic errors.

Meehl was talking about when the theory makes a precise prediction, and you set that as the “null hypothesis”. I dont think particle physics is always like this either.

Take the Higgs boson data. What theory was the background model derived from?

My understanding is that in a universe described by the standard model without a Higgs, the probability of certain events occurring would be greater than 1. So a new particle was theorized to “fix” this apparent contradiction in the math.

But no particular mass was predicted for this particle, so they instead tested for what expected to see in a universe where overunity probabilities are possible, and when a deviation from that was detected, they assumed it must be the Higgs. This is more like Meehl’s idea of “psychology” than “physics”.

Now, as I understand it there were some theoretically predicted properties of the Higgs, just not the mass. And whatever it is they detected had those precise properties. If thats correct, then that is why we should trust the Higgs version of the standard model. The 5 sigma deviation from a universe with overunity probabilities is irrelevant.

I see from this table that various alternative properties have been rejected at p = 0.05, p = 0.001, etc. Other predicted results were within 15% if what was observed:

https://en.m.wikipedia.org/wiki/Higgs_boson#Current_status

So the evidence for the higgs boson is far weaker than the evidence against the higgsless standard model no one believed in.

Historically, this is how NHST “infects” a field. First testing a strawman model is done in addition to science (testing your model), but slowly it takes up more and more space and importance due to the possibility of “sexy” (but wrong) claims like “Higgs boson exists with 5-sigma confidence!”

Saw the same thing with LIGO.

Unitarity violation (probabilities greater than one) would not happen until the TeV scale, so at 125 GeV (the observed Higgs mass) using a higgless Standard Model was not a strawman. i.e. There were lots and lots of alternative models that solved the unitarity problem but at 125 GeV were indistinguishable from the Standard Model without a Higgs.

No physicist I know of ever said the “Higgs boson exists with 5-sigma confidence!”. The ATLAS and CMS discovery papers reported “An excess of events is observed above the expected background, with a local significance of 5.0 standard deviations, at a mass near 125 GeV, signalling the production of a new particle” with properties that are “compatible with the production and decay of the Standard Model Higgs boson”. There exist lots of alternative models that are compatible with the observed Higgs properties but predict additional new physics (e.g. more new particles or deviations in the observed Higgs properties) that high energy physicists continue to test for.

Over the years, more and more measurements have been made of the properties of this new boson and they are all depressingly consistent with the standard model Higgs. It is true that except for the spin-parity, which is established with high confidence, most individual properties have only been tested at the 10-20% level, but there are a lot of properties measured and they are all consistent with the Higgs. Since it looks like a Higgs, walks like a Higgs, and quacks like a Higgs, we have decided to call it a Higgs, even if it eventually turns out not be not exactly the Standard Model Higgs.

It is also important to be aware of the bias of physicists when they make these tests. The existence of a Standard Model Higgs means that it is theoretically possible that there are no new fundamental particles/forces above the Higgs mass (125 GeV) all the way up to the Planck Scale (1e19 GeV!). If there is no new physics, the future of accelerator based high energy physics is very uncertain. We don’t want the properties of the observed 125 GeV boson to be consistent with the Standard Model Higgs. We want disagreement! Given this strong bias, it should be impressive that we haven’t yet found any.

How is it not a strawman? If you take a group of assumptions and derive the consequences to find two things:

1) The background expected from the LHC experiments

2) That probabilities can be greater than one

Then you must accept #2 if you want to believe #1 is accurate. If that is not what was done I’d appreciate learning about it. What exact assumptions were used to derive the background model, and do those assumptions also lead to problems like unitarity violations?

Also, do you have a link to any papers from the 1960s or earlier discussing this “unitarity violation”?

From ATLAS:

https://atlas.cern/updates/press-statement/latest-results-atlas-higgs-search

The opposite of “a universe without a Higgs” is “a universe with the Higgs”. Of course, this statement neglects to mention non-standard model possibilities.

zbicyclist said, “The problem isn’t “digging into the data (bad, post-hoc…)”. The problem is not being honest about this, and presenting the garden path as the only one investigated. Lack of journal space is no longer a valid reason to omit the other paths investigated, since extra material can be linked to outside the article itself.”

+1

“Here is a great example of people using statistics to test their hypothesis instead of a strawman hypothesis. You can see that when deviations from the predictions of their model are observed the scientists are motivated to include every single source of uncertainty they can think of until it “goes away””

…this is a great point, but it focuses on the MODEL. If we could get all students etc. to focus on, and be clear about, the MODELS they are considering, and whether or not they make sense for the questions they are asking, I think a lot of the controversies might die down a bit.

E.g., *if* you are in a literal randomized controlled trial xperiment with a single experimental variable changed and one thing being measured, *then* the classic “null model” of “no difference” might make sense, because the only alternative is “some difference”.

But in many other contexts where data are basically convenience samples and the study is basically exploratory, or the researcher is trying to model a complex natural process and infer things about the process, then we have to think hard about “all models are wrong” and “we are just looking for a good tradeoff between model complexity and model fit” etc.

My “if I were emperor for a day” recommendations are something like:

* Delete the term “null hypothesis”, always use “null model”

* Always highlight that there are usually a great many alternative models, except for very simple, careful controlled experiments

* Teach models, likelihood, likelihood maximization, likelihood ratios, and the LRT, and introduce the basics of Bayesianism, Likelihoodism, & Frequentism, before doing least-squares and p-values

All IMHO…

If you’re naming the fallacious animal maybe, but if your talking significance tests, you’re wrong. They even use significance tests here. Only a fallacious off-wall-variation of significance tests approves of testing a point null and moving from a small p-value to one of a great many possible “explanations”. Non-fallaciously used, they are ideal tools to pinpoint blame for anomalies, because they are piecemeal. When you “test” priors combined with models and hypothesis all at once, Duhemian problems (of which components to lay the blame on) loom large.

Deborah:

Every time you call null hypothesis testing an “animal,” I’m gonna say: no, NHST is not an animal, it’s a statistical method that thousands of researchers use every day. It’s a method with real problems, and these problems have malign consequences; see for example section 2 of this article.

The main problem with null hypothesis testing is that the hypothesis you are rejecting (effect is exactly 0.0000000…) is infinitesimally small. You’ve rejected basically nothing of your hypothesis space. Practically, this test will pick up on the tiniest of bias in your experimental procedure and always strongly reject the null if you have enough data.

Your null should at the very least, have a width. You should always be rejecting “effect is greater than some margin” which you should have to argue is greater than any bias you might expect in your experiment. There are always at least tiny biases.

The thing is that to choose this width requires a choice of a utility, and as soon as you involve utility you may as well do a Bayesian decision theory calculation, especially since it’s been proven that the class of Bayesian solutions dominates any other solution class (Wald’s theorem).

So it comes back to p value testing is not usually the right way to think about these things.

Where p values have the most utility is where gelman has advocated them: can you detect a meaningful mis-fit in your model, and if you fail to reject a mis fit then for the moment you might as well stop tweaking your model, and if you do detect a mis fit then bring in the next substantive component and see if the model fits better.

So should we infer that the p-value is the best tool for detecting a meaningful misfit? Most utility as compared with what?

I ask because the p-value is one of the most misinterpreted terms, to begin with? Read so many variations of the definition.

> should we infer that the p-value is the best tool for detecting a meaningful misfit?

there is no “the p value” since we have many many tests we could perform and there is a p value associated with each test. Is it the “best” tool for detecting misfit? Maybe not, I’m not sure, but it is a good tool in this context. The bigger question than whether to use a p value to detect misfit is “what test has utility in a given context?”

For example should we test whether the mean of a sample is different from the assumed mean? or should we test whether the total probability in the upper tail past the value 27.2 is increased compared to the predictive distribution for the model?

when I say “most utility” I meant compared to other uses the p value is often put to. So for example testing goodness of fit of a Bayesian model using a p value is a better use of a p value than testing a null hypothesis of zero difference in means via a t-test and then when we reject it, immediately assume that the true difference in means is near to the observed differences in the sample.

I meant ‘a’ p-value.

Then you are in agreement with lowering the threshold as suggested in Benjamin et al?

I suppose if we lowered it to 1e-301 then everyone would give up using it, and that’d be a generally good thing 😀

but no, lowering the p value threshold while letting people continue to make logical errors like straw man NHST would be unhelpful, it would simply increase the cost of doing bad science, but since the cost is born by taxpayers and not scientists, there would be no reduction in supply…

I feel Mayo gets the better of Gelman in the published exchange, but that is mainly because Gelman puts himself in a really weak position by trying to make a very general (but very dubious) point, namely that P values are in general not of great use.

His observation about testing a point null hypothesis that you know must be wrong is a valid one, but you can easily get around that in, for example, testing that a relative risk is equal to one in a clinical trial, such that P values retain their utility (which is distinct from their possible connection with CIs).

Gelman though seems to be thinking about using P values in model checking, which requires a different debate altogether.

Mayo’s rejection of NHST is interesting given that she appears to be a big fan of Neyman and his uniformly most powerful tests etc. Isn’t this contradictory?

I think without explicit examples many discussions fray or aborted. There is a good deal of incomplete theorization in many fora, which can be frustrating.

Look at what she wrote: “I say drop NHST. It was never part of any official methodology.” So what’s being dropped? I guess it depends on what falls under “official methodology”. For her, Neyman-Pearson hypothesis testing (and its logic as expounded by Neyman and Pearson — accept no substitutes!) qualifies, as does Fisherian significance testing (again, as expounded by Fisher).

I’m pretty sure when she says “drop NHST” she takes NHST to mean just the case where “rejection of straw-man hypothesis B is used as evidence in favor of preferred alternative A” — she remarks that this is “a testing fallacy. N-P and Fisher couldn’t have been clearer.”

Neyman definitely partakes in NHST when time comes to apply his methods:

https://statmodeling.stat.columbia.edu/2019/09/06/__trashed-2/#comment-1117823

Fisher seems not to ever do it though.

Fisher being a working scientist was wise enough in practice to fudge his principles with a scientist’s judgement. Neyman being a failed mathematician had no such BS meter.

Mayo never made the “Neyman mistake” of trying statistical inference in public, either with her own or anyone elses methods.

But philosophers didn’t always operate this way. When Descartes wrote his “Discourse on Method” he illustrated it with three appendixes: one inventing analytic geometry, one discovering the law of refraction, and one giving the correct geometric explanation of rainbows. All three were huge advances.

Perhaps Mayo with her deep new insights into statistical inference could regale us with examples, of her own doing, applying her own ideas, to real data and real problems?

Just one maybe? In four decades of research on this topic, just one?

Anon:

You can disagree with Mayo; that’s fine. But to criticize her for something she doesn’t do, that’s not fair. I don’t know if Bill James ever stole a base in his life; that didn’t stop him from making contributions to baseball.

But we’re his contributions to the “correct” methodology of stealing bases?

Andrew,

I wouldn’t dream of criticizing a guru on statistical methodology who hasn’t make a real statistical inference in four decades of lecturing on how it’s done.

I just think it would extraordinarily illuminating, both for Mayo and everyone else, if she did.

Doesn’t look too good for old Neyman there.

I’d bet good money Mayo would make the same “NHST animal” mistake if faced with real data and a real question to answer from it. Entire Academic communities don’t make “well known howlers” for 70 years because they were “taught wrong” or whatever excuse Frequentists trot out. Obviously they keep doing it because something deep in Frequentist foundations makes them think it’s right.

Mayo can ignore this by the convenient habit of never making a real statistical inference.

As blunders go, assuming your method works because it agrees with your philosophy is the Statistics version of “start a land war in Asia”. Yet Mayo, who was never once tested by reality, has no other criteria for judging her ideas.

At this point Andrew will jump in with “lots of professional sports coaches never played the sport they coach!” blah blah blah or some such excuse like that.

Corey:

I was referring to “NHST”, the abusive animal characterized and criticized by people for allowing moves from statistical to substantive, and allegedly giving bright line true false results. Or worse, claims that the null or alternative are “proved”. If anyone says it refers to actual significance tests as put forward either in the N-P or F form, then the criticisms don’t hold. If the criticisms hold, then people are referring to the abusive NHST animal. The latter reading is the one I’m giving because it is now so tightly ingrained in official statements.

Deborah:

Every time you call null hypothesis testing an “animal,” I’m gonna say: no, NHST is not an animal, it’s a statistical method that thousands of researchers use every day. It’s a method with real problems, and these problems have malign consequences; see for example section 2 of this article.

Lakens: No, you misunderstand, or rather I wasn’t sufficiently clear, as I think this was an informal exchange on blog comments or maybe email (I only just noticed it). I was rejecting the fallacious animal that has typically been associated with NHST wherein one moves from rejecting a 0-effect null to a substantive theory T. Insofar as T has been inadequately probed by dint of the statistical tests–even if we grant it is sound and not invalidated by biasing selection effects–such moves are blocked by any kosher use of significance tests, and certainly by a severe tester. Fisher, N-P and everyone else always emphasized this, correlation is not cause, etc. etc. As such, it’s a mistake to reject error statistical tests by insisting it equals the straw man NHST view of tests.

Deborah:

You write, “the fallacious animal that has typically been associated with NHST wherein one moves from rejecting a 0-effect null to a substantive theory . . . the straw man . . .”

But it’s not an animal and it’s not a straw man. NHST is a real statistical method that is used all the time. It’s the basis of huge amounts of science. You can argue that in practice NHST is not so bad, or you can argue that alternatives such as hierarchical Bayesian modeling would be worse, or you could argue that NHST is a reasonable option given resource constraints, or you could argue all sorts of things. But NHST is not a straw man. It’s what many, maybe most, researchers do, and conclusions supported by NHST influence a lot of policy recommendations. Yes, I think Cass Sunstein, James Heckman, etc., are influential.

Naked Stat: I must qualify that misimpression (this was an informal exchange someplace). I was rejecting the the fallacious animal to which “NHST” is generally referred. See Farewell Keepsake from SIST. https://errorstatistics.files.wordpress.com/2019/04/souvenir-z-farewell-keepsake-2.pdf

Deborah:

Every time you say “fallacious animal,” I’m gonna say: no, NHST is not an animal, it’s a statistical method that thousands of researchers use every day. It’s a method with real problems, and these problems have malign consequences; see for example section 2 of this article.

Andrew: It’s very simple you can’t have it both ways. If you say it refers to actual significance tests as put forward either in the N-P or F form, then your criticisms don’t hold. If your criticisms hold, then you’re referring to the abusive NHST animal. The latter reading is the one I’m giving.

If you give a criticism that holds for actual significance testing, or error statistical hypothesis testing (N-P or F) we can consider it.

I myself reformulate, extend the tests, but that doesn’t mean the tests license the absurd criticisms raised against it

The same man who formulated N-P tests, formulated confidence intervals and in the same years ~1930.

Deborah:

To clarify: I’m not talking about what you call the “animal.” I’m talking about a statistical method by which a researcher gathers data, computes a p-value or Bayes factor or some other data-based summary statistic, and uses that to declare that an effect is real or not, or that a hypothesis is true or not, or some other binary decision. I’m also talking about variants of this method, such as a tripartite rule under which a researcher declares that an effect is definitely true if p is less than 0.01, that an effect is small or intermittent if p is between 0.01 and 0.1, and that an effect is zero if p is greater than 0.1.

Again, it’s not an animal, it’s a statistical method, it’s a commonly used statistical method (or, to be precise, a class of statistical methods), and it has major problems.

Sorry for jumping in so late, but there are so many comments here from so many people, I’m trying to distill them into the basic positions:

1) Hypothesis testing via something like p-values is a reasonable way to challenge a model and drive theory development, even though it is not typically how scientists use this procedure.

2) The frequent misuse of the procedure outweighs its potential utility so it would be best to drop it entirely.

First, does this seem a reasonable description of the positions under discussion?

Second, on the basis of various other comment threads, it seems like there are two kinds of statistical applications being discussed (and people are often vague about which they mean):

A) How to challenge scientific theories with the aim of developing them further.

B) How to guide policy decisions.

While these are not incompatible, they might well be served by different procedures in different ways. For example, with regard to A, model comparison is very natural in a Bayesian framework, but model checking is a type of hypothesis test. Application B seems to me to be essentially a kind of Bayesian decision problem (as Lakeland points out), but maybe this is just lack of imagination on my part; for example, hypothesis testing could indicate whether enough evidence was available to make a decision *at all*.

Anyway, just putting this out here to try and better understand the points of contention and how we can find a path toward reconciliation (in other words, why can’t everyone just get along?).

From the perspective of Bayesian Decision Theory “do nothing and wait for more data” is just another decision. With this “closure” of the decision space, there is no way to “not make a decision” ;-)

Indeed, this is how I would try to formulate a decision problem as well–have a model for the “default” or “status quo” as part of the choice set. But I could imagine a two-stage process where one first has to pass a “low bar” before even getting to that point, and this could be formulated in terms of error-detection in the manner of a hypothesis test (e.g., is my current state still doing “good enough”?).

In fact, I suspect most scientists, Bayesian or not (including myself), implicitly follow such a two-stage process in analyzing data. An initial determination of whether or not the data contain “interesting” information at all followed by a more involved model comparison/development process.

That first stage is rarely formalized at all, but it seems like that would be a good role for hypothesis testing. E.g., what is the (marginal) likelihood of the data given my best current understanding of the system (which could itself be expressed as a Bayesian prior distribution over models/parameters)? This is, of course, the “surprise index” (well, if you take the negative log anyway) from info theory, so we could say that data are “interesting” if they are “surprising” given our current best understanding. And we only update our beliefs if the data pass our surprise threshold.

I’m not saying this is the best way to go, just that it is not wholly unreasonable and, I suspect, what most scientists actually do in practice. Problems arise in how to determine what is surprising and how to update beliefs, whether or not these are integrated into a single process or done separately. E.g., violating a straw-man null is *not* interesting, because that kind of null doesn’t actually reflect our best current understanding.

As long as you set your hypothesis as the null hypothesis, sure. That is not what people do though.

The misuse is testing a hypothesis not predicted by your theory (exactly zero difference between groups, etc). Mayo never provides real life examples of what she would recommend afaict, but I am fairly certain that if she did it would qualify as a misuse.

Naked Stat: By the way, Neyman would never do a test with a single hypothesis. NHST’s have single nulls. Neyman developed the “test hypothesis” as composite, and it was on par with the alternative. together they exhausted the parameter space

Neyman is first author on a paper that does that (and multiple other much worse “statistical sins”) here:

https://statmodeling.stat.columbia.edu/2019/09/06/__trashed-2/#comment-1117823

I think Andrew puts his finger on why he (a political scientist) and Mayo (a philosopher) will probably always talk past each other. In philosophy it’s customary to orient your discussion by invoking The Greats. If you’re a Stoic, you begin, “Epictetus wrote …,” and that is presumed to be a good starting point. By contrast, the British economist Joan Robinson wrote a famous essay on why she was not a Marxist or a Keynesian economist, she just “did economics” without starting from what the greats wrote.

Biology is highly complex and hidden. Capturing some or all its dynamics is a matter of chance, luck, and opportunity. Some will simply be better diagnosticians for whatever reason.

Appeals to the greats amount to a kind of authoritarian argumentation. It is true because so and so said it. When examining rhetoric, most scientists, engineers, etc. perceive invocation of authority as indicative of an absence of substantive arguments. Of course it is not always so, as some speakers simply suffer a psychological need to provide external certification of their internally well-reasoned positions, but the invocation of authority generates skepticism in those inclined to skepticism. Why do they need to invoke authorities if they have good arguments? Other invoke authorities simply as a way of setting the table. To start the discussion, why look for one’s own way of saying something when it was already well said? These are purely psychological claims, but I don’t think they have anything to do with one person being a philosopher and the other a political scientist. They have to do with initial intellectual orientation, and the approach to identifying and solving problems.

Science understood broadly enough to include statistics requires a sort of anti-authoritarian thinking. One needs to constantly identify and question the premises of whatever one is arguing, because often the errors and confusions occur at that level.

I think the exchanges immediately above here in comments, in which Andrew responded at 5:22 pm and 6:01 pm on September 11, are more supportive of my diagnosis than yours, but lots of readers here are qualified to judge for themselves.

The times shown are displayed in local time, so only people in your time zone know what you’re talking about. Can you link directly to the comments? Right Click the date under the person’s name and copy the link.

Oh, sorry, I didn’t know that. I mean when Andrew responds to “fallacious animal.” IMO Mayo says NHST is “fallacious” because it is not endorsed by the great texts, while Andrew means something different.

Actually it looks like I’m wrong. Apparently it’s in Eastern US time. In any case, the best way to refer to a comment is to link directly to the comment, the blog gets a bit jumbled as stuff gets inserted all over the place… what seems to be “directly above” at one point in time becomes separated by an arbitrary number of comments.

I think you’re talking about this exchange:

https://statmodeling.stat.columbia.edu/2019/09/11/exchange-with-deborah-mayo-on-abandoning-statistical-significance/#comment-1119496

I’m not sure, I do think Mayo gives a lot of weight to the originators of certain ideas, like Fisher and Neyman and Pearson, but I think that’s because they were the ones who came up with the ideas, so she views any alternative ideas using similar but different logic as “not what we’re talking about”. The problem is, no-one is really doing Fisherian stuff, and N-P testing seems to be used almost exclusively in the bad NHST style…

Her response is basically “let’s step back to the original ideas, and then add on this Severity tweak” and all will make more sense… except that there have been multiple analyses of the actual formal SEV concept which show it fails to have good properties. One example is in say sequential analysis of data (Corey wrote a whole blog post on it: https://itschancy.wordpress.com/2019/02/05/the-sev-function-just-plain-doesnt-work/ and at least one follow up…)

I guess what I’m saying is I don’t think Mayo’s arguments are pure appeal to authority, but I also don’t think she has great arguments for any particular formal procedure at all.

I didn’t say appeal to authority — I agree, she begins from a canonical textual position.

“Science understood broadly enough to include statistics requires a sort of anti-authoritarian thinking. One needs to constantly identify and question the premises of whatever one is arguing, because often the errors and confusions occur at that level.”

This is arguably even more true of philosophy.

My own humble attempt at a reconciliation….

The problem with NHST and p-values is in the decision, not the mechanics (though I know there are plenty of interesting objections to the latter). For myself, the p value provides useful information – expressed as a confidence interval, even more so. However, choosing any particular cutoff is the problem – as is thinking we need to decide whether the treatment is or is not effective. The evidence will always be ambiguous, but the degree of ambiguity (and the associated costs and benefits of different decisions, as Daniel urges) matters.

I find it hard to believe that Mayo wants to use p values to determine that a particular treatment is or is not effective. Rather, severe testing would seem to ask for assessing the evidence and its strength. Of course, we know a decision needs to be made – the clinician and patient need to decide whether or not to use a particular treatment. But does anybody really want that decision to be dictated by any particular p-value? And, on the other hand, does anybody really want to argue that the p value contains absolutely no useful information for making that decision?

Go ahead, support those extreme positions.

Who is arguing this? To get a p-value to do NHST you must first calculate an “effect size”, which is clearly information that could help in predicting future outcomes.

Can you give us an example? My understanding is that early ‘effects’ sizes are inflated. No?

http://datacolada.org/wp-content/uploads/2014/04/Ioannidis-2008.pdf

The value could be inaccurate, but as long as it correlates with something you want to know you can throw it into your model and improve the predictive skill. I’m not sure an example is needed for that.

Anoneuoid,

You must be putting me on. WTF?

Not at all. To make useful predictions all you need is a correlation to be present in the same direction for both the training and future data. All of machine learning is based on this fact.

You need this signal to be present and in the same direction and somewhat consistent magnitude in *both the training and future data* as Anoneuoid says…

For this to consistently be the case it usually requires a causal connection of some sort between the two things. It doesn’t necessarily lead to a causal model though. For example high SAT tests are associated with high levels of academic achievement. The causal connection is that some sort of brain structure makes you good at taking SAT tests, and also makes you interested in and capable of graduating from undergraduate, or graduate school. But going in to the computer at Educational Testing Service and altering the SAT score to erroneously read much higher than the correct score doesn’t increase attainment necessarily.

Or it is that your muscles don’t contract as often so your liver/kidneys have to process less of one waste product and can produce something that enhances cognitive skill. Who knows?

Actually the other day I started wondering if at least some seizures were due to disordered muscle contractions, and the brain activity associated with it (and thought to cause it) was actually a compensatory response. I found that people with seizure issues who were not helped by drugs were often self-helping themselves by doing muscle contractions, taking muscle relaxers, etc. I don’t personally want to go down that rabbit hole but it seems like an intellectually profitable thing to explore.

But the thing is that there is no lack of culling evidence and evaluating its strength. That is what the Evidence-Based Movement was about. I kinda remember reading Doug Altman’ seminal article on the subject back in the 90s. That movement is now being used as a source for marketing treatment options and trials. So addressing this situation calls for Open Science. We all get that do care about the prospects for better science.

Dale said,

“Of course, we know a decision needs to be made – the clinician and patient need to decide whether or not to use a particular treatment. But does anybody really want that decision to be dictated by any particular p-value? And, on the other hand, does anybody really want to argue that the p value contains absolutely no useful information for making that decision?”

+1

The irony is Gelman has used p-values and significance tests in his published oeuvre, but there’s not single non-toy example of such a thing in Mayo’s voluminous writings anywhere.

I deal with examples that have been the subject of enormous philosophical debates and confusion, and if you ever trace them out, you will discover, as I did in years of research, that they come back to the very simple examples I discuss in Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. If you can find an example that has been at the center of philosophical controversy–and these aren’t “academic” but have been taken seriously enough to lead thought leaders at the ASA and elsewhere to recommend alternatives to error statistical tests–that I overlook in SIST, please send it to me.

Reading other’s examples and doing some real life statistical inference yourself, are entirely different things.

Besides, you’ve discovered major insights and advancements into statistical inference. That’s like owning a Ferrari while everyone else is driving model Ts. Don’t you have the urge to take your insights out for a spin in the real world? Really show people how much better your philosophical approach is when rubber meets the road?

If so, it would be great see you apply your insights to real data and a real (answer not known ahead of time) problems.

I’ll take this as an opportunity to test the explanation I’ve given to stats and research methods classes. I say: your data, if they’re measured properly (a big if) are descriptive of the cases from which they’re drawn. If all you care about is those cases, you have a census, and you probably don’t need to do much if any statistical work. Just report the data.

Most of the time, however, your real interest is in the universe from which those cases were drawn. This can be a wider population in the present or future populations or both. Then you’re in the world of inferential statistics. The problem is the extent to which you can generalize what you found in the cases you had data for. This is complicated for many reasons: variability of measurement accuracy or appropriateness (for proxies) across cases, unknown as well as known potential for sample bias, etc. One aspect of this is the “pure” variability that would arise from resampling a known population. That’s what p values aim at measuring by comparing average effect size to effect dispersion and adjusting for sample size. But this works only in the pure case. To get that case, we even have to make a heroic assumption that our sample dispersion is fully informative about the population dispersion. In reality, the pure case is extremely specialized and not generally representative of the real situation. Much better is to study the generalization problem directly, by considering all the potential obstacles to generalization you can come up and testing for them. (I give this the name “robustness analysis”.) Of course, effect-size-to-dispersion, along with sample size, gives you important information, but combine that with other aspects of the generalization problem. (Then tell the story of Student and his advocacy of distributing test plots according to known soil and aspect gradients rather than randomly. This was especially helpful for me because many of my students were experimental ecologists.)

The fundamental problem with p-values is that they falsely simplify the problem of generalization. (This aside from their widespread misuse as ostensible support for the hoped for true—“alternative”—hypothesis.)

Peter said,

“I’ll take this as an opportunity to test the explanation I’ve given to stats and research methods classes.”

OK, Here’s my opinion:

This part is a good beginning: “I say: your data, if they’re measured properly (a big if) are descriptive of the cases from which they’re drawn. If all you care about is those cases, you have a census, and you probably don’t need to do much if any statistical work. Just report the data.”

The rest, although it states very important things, is too much for students to assimilate at once. So it needs to be stretched out over the course of the class, with little bits pointed out as they come up. Your list provides a good outline of points that need to be made, but I think the points are more likely to be made (and stick) if they are made in a “class participation” manner rather than just stated. I’ve got a lot of examples of how I (at least try to) do this in the “slides” that can be downloaded from https://web.ma.utexas.edu/users/mks/CommonMistakes2016/commonmistakeshome2016.html

Some of these are in the form of “quizzes” for class discussion (see, e.g., pp. 43 and 61 of Day 3 slides). For these, I first ask for a show of hands on which classification (“Doesn’t get it”, “Gets it partly, but misses some details”, “Gets it!”), then ask for volunteers to support their answer, with discussion as needed to help make the point.

Well yes, of course! I didn’t mean that I say all this at one time. This is an agenda for weeks and weeks. And I also agree that lecturing at people is often not the best way to get the point across. But what I hope is that readers of this thread will think about this as a way to bring students into the issues surrounding p-values, and with a positive, constructive framework.

With regard to clinical trials – it is only the regulators that have to decide on drug approval that may need to have clear dividing lines (thresholds). The “may” is used in case they are able to use them as defaults rather than strict requirements. (To avoid risk of law suits, reasons for not following defaults need to be very clear and well documented).

But academic don’t need this and should only make suggestions regarding approval rather than approval decisions.

I do owe a blog on giving some sense of what happens in drug regulation – but what interests or should interest most here are assessments of evidence and perhaps informed decisions based on full consideration of all evidence and potential losses.

So I don’t think Mayo’s arguments apply outside regulatory agencies.

This is an artificial problem. Regulators shouldn’t be approving drugs on this basis to begin with.

Individuals should be given all the information available to judge whether the expected benefits of a treatment outweigh the expected costs for their particular situation and priorities. Many times the answer will be “we do not know”. The role of regulators would be to ensure that information is not fraudulent.

Anoneuoid said: “Individuals should be given all the information available to judge whether the expected benefits of a treatment outweigh the expected costs for their particular situation and priorities. Many times the answer will be “we do not know”. The role of regulators would be to ensure that information is not fraudulent.”

Agreed. But we have a long way to go before anything near this is can be considered standard practice.

Keith: Are you saying none of my arguments hold outside regulatory contexts? If you agree that I shouldn’t be allowed* to spin any and all results as in sync with a claim, with none challenging or counting as evidence against it, then you think there should be minimal predesignated thresholds. The very minimal weak severity requirement is premised on this intuition

*At least if I’m purporting to have evidence or grounds or warrant (or whatever term you like) for something.

This holds as much in day to day reasoning, in philosophy, law, or any other field, as in regulatory contexts–even though, obviously, there is the opportunity for greater uniformity & formalization there.

I’d be glad to read your blog post on drug regulation. My knowledge is largely from trading the biotech stocks, & following the daily FDA narratives.

> if I’m purporting to have evidence or grounds or warrant (or whatever term you like) for something.

In science, it should be a community that should be “approving” such claims so I probably should have included review groups. But why would they need a set threshold rather than vary it as assessed was most appropriate. And in science we shouldn’t seek final answers but rather just make pauses until we know how we are wrong.

> none of my arguments hold outside regulatory contexts?

Instead of “hold” I would say not the most appropriate.

Keith said, “I do owe a blog on giving some sense of what happens in drug regulation “

Please do give us such a blog when you have a chance!

I think that the basic problem with NHST (and Andrew has said this but in a more diffused way) can be put this way –

We want to see if an experiment shows an effect. We find p = .1. With that value of p, the results could well have happened by chance if the effect E had really been 0. So the NULL HYPOTHESIS of E = 0.0 has not been rejected.

Great, nice to know, but for many experiments, we couldn’t reject E = .1, E = .2, E = -.15, etc., either. Why should the value E = 0.0 have a special privilege over any of those others? If there is any reason, it’s not in the numbers, but in some theory, or prior knowledge, or something else.

Tom: If you formulate tests as I do, you would directly distinguish those discrepancies which are and those which are not warranted by the data. We look at the capacity of the test to have detected various parametric effect sizes and use those error probability assessments to ascertain how severely claims have passed. The claims go beyond the commonly used nulls and alternatives which, for N-P, exhaust the parameter space. But they also developed CIs and power analysis, the severity function extends slightly. https://errorstatistics.com/2018/12/04/first-look-at-n-p-methods-as-severe-tests-water-plant-accident-exhibit-i-from-excursion-3/

Deborah, since I started to read your blog, I have appreciated your language terms “severely tested” and “[not] warranted by the data”. They seem to me to be exactly what we really want to know. For myself, who has rarely had to get into any really complicated statistics, I generally think in terms of number of standard deviations or standard errors. It’s simple enough for most of my needs. Of course, they can be translated (with the right assumptions) into p-values, confidence bands, likelihoods, or reworked into Bayes factors if one really wants to, but the standard errors are underneath them all, at least for my usual needs.

OTOH, I haven’t had to deal with some of the really complicated situations that come up on this blog. So it’s easy for for me to say what I just did…

It seems to me that you’re not really arguing about much. You both want to get rid of NHST. Mayo seems to be disturbed that you want to band p-values entirely because she thinks they’re useful when used correctly. You seem to believe that, given scientists track records when they’ve had unfettered access to p-values, their use should be eliminated; and that better methods exist for any case they’d be used anyway.

Myself, I agree they should be banned entirely primarily because they lead people into silly logical traps, even if not NHST. The errors are too easy to make. However, I’m suspicious that abuse will diminish should any other method dominate. We’re already seeing all kinds of problematic articles on Bayes factors.

» Mayo:

You don’t give up on correct logic because some people use illogic.That’s funny coming from someone who explicitly touts induction—the prime example of illogic—as what “scientific inference” is all about in the end. I try to explain why that is fundamentally wrong, and why induction

is not needed in science, in this comment.I would enjoy an autobiographical account of a scientist which contains a detailed description of how she came to think about her scientific endeavor b/c the technical language used in much research just is too limited theory-wise. Not to say that an autobiographical account would necessarily contain a theory. The story would be more explanatory to me at least.

Uses of logic are very helpful at times. Yet I think some dimensions of reasoning toward a ‘scientific’ query/endeavor just can’t be explained easily. Epiphanies can result randomly.

Some people can solve quite complex problems without any identifiable and seemingly necessary training. I’ve observed this in some situations.