I think this presentation by John Ioannidis offers an interesting case study and insights into why consumers/patients are concentrating on qualitative side/evidence. When experts explain their research results in plain English [without jargon] then it is feasible that consumers/patients can also add value to research-beyond the trials.

]]>This is arguably even more true of philosophy.

]]>Every time you call null hypothesis testing an “animal,” I’m gonna say: no, NHST is not an animal, it’s a statistical method that thousands of researchers use every day. It’s a method with real problems, and these problems have malign consequences; see for example section 2 of this article.

]]>Neyman would never do a test with a single hypothesis.

Neyman is first author on a paper that does that (and multiple other much worse “statistical sins”) here:

https://statmodeling.stat.columbia.edu/2019/09/06/__trashed-2/#comment-1117823

Mayo can ignore this by the convenient habit of never making a real statistical inference.

As blunders go, assuming your method works because it agrees with your philosophy is the Statistics version of “start a land war in Asia”. Yet Mayo, who was never once tested by reality, has no other criteria for judging her ideas.

At this point Andrew will jump in with “lots of professional sports coaches never played the sport they coach!” blah blah blah or some such excuse like that.

]]>In fact, I suspect most scientists, Bayesian or not (including myself), implicitly follow such a two-stage process in analyzing data. An initial determination of whether or not the data contain “interesting” information at all followed by a more involved model comparison/development process.

That first stage is rarely formalized at all, but it seems like that would be a good role for hypothesis testing. E.g., what is the (marginal) likelihood of the data given my best current understanding of the system (which could itself be expressed as a Bayesian prior distribution over models/parameters)? This is, of course, the “surprise index” (well, if you take the negative log anyway) from info theory, so we could say that data are “interesting” if they are “surprising” given our current best understanding. And we only update our beliefs if the data pass our surprise threshold.

I’m not saying this is the best way to go, just that it is not wholly unreasonable and, I suspect, what most scientists actually do in practice. Problems arise in how to determine what is surprising and how to update beliefs, whether or not these are integrated into a single process or done separately. E.g., violating a straw-man null is *not* interesting, because that kind of null doesn’t actually reflect our best current understanding.

]]>Besides, you’ve discovered major insights and advancements into statistical inference. That’s like owning a Ferrari while everyone else is driving model Ts. Don’t you have the urge to take your insights out for a spin in the real world? Really show people how much better your philosophical approach is when rubber meets the road?

If so, it would be great see you apply your insights to real data and a real (answer not known ahead of time) problems.

]]>I wouldn’t dream of criticizing a guru on statistical methodology who hasn’t make a real statistical inference in four decades of lecturing on how it’s done.

I just think it would extraordinarily illuminating, both for Mayo and everyone else, if she did.

]]>I think you’re talking about this exchange:

I’m not sure, I do think Mayo gives a lot of weight to the originators of certain ideas, like Fisher and Neyman and Pearson, but I think that’s because they were the ones who came up with the ideas, so she views any alternative ideas using similar but different logic as “not what we’re talking about”. The problem is, no-one is really doing Fisherian stuff, and N-P testing seems to be used almost exclusively in the bad NHST style…

Her response is basically “let’s step back to the original ideas, and then add on this Severity tweak” and all will make more sense… except that there have been multiple analyses of the actual formal SEV concept which show it fails to have good properties. One example is in say sequential analysis of data (Corey wrote a whole blog post on it: https://itschancy.wordpress.com/2019/02/05/the-sev-function-just-plain-doesnt-work/ and at least one follow up…)

I guess what I’m saying is I don’t think Mayo’s arguments are pure appeal to authority, but I also don’t think she has great arguments for any particular formal procedure at all.

]]>1) Hypothesis testing via something like p-values is a reasonable way to challenge a model and drive theory development, even though it is not typically how scientists use this procedure.

As long as you set your hypothesis as the null hypothesis, sure. That is not what people do though.

2) The frequent misuse of the procedure outweighs its potential utility so it would be best to drop it entirely.

The misuse is testing a hypothesis not predicted by your theory (exactly zero difference between groups, etc). Mayo never provides real life examples of what she would recommend afaict, but I am fairly certain that if she did it would qualify as a misuse.

]]>Unitarity violation (probabilities greater than one) would not happen until the TeV scale, so at 125 GeV (the observed Higgs mass) using a higgless Standard Model was not a strawman. i

How is it not a strawman? If you take a group of assumptions and derive the consequences to find two things:

1) The background expected from the LHC experiments

2) That probabilities can be greater than one

Then you must accept #2 if you want to believe #1 is accurate. If that is not what was done I’d appreciate learning about it. What exact assumptions were used to derive the background model, and do those assumptions also lead to problems like unitarity violations?

Also, do you have a link to any papers from the 1960s or earlier discussing this “unitarity violation”?

No physicist I know of ever said the “Higgs boson exists with 5-sigma confidence!”.

From ATLAS:

A statistical combination of these channels and others puts the significance of the signal at 5 sigma, meaning that only one experiment in three million would see an apparent signal this strong in a universe without a Higgs.

https://atlas.cern/updates/press-statement/latest-results-atlas-higgs-search

The opposite of “a universe without a Higgs” is “a universe with the Higgs”. Of course, this statement neglects to mention non-standard model possibilities.

]]>In science, it should be a community that should be “approving” such claims so I probably should have included review groups. But why would they need a set threshold rather than vary it as assessed was most appropriate. And in science we shouldn’t seek final answers but rather just make pauses until we know how we are wrong.

> none of my arguments hold outside regulatory contexts?

Instead of “hold” I would say not the most appropriate.

Now if clinical reports submitted to regulators are made public (with no hint of selectively made available) then I would trust those the most. There is some hope that this will be the future.

Now if published papers were randomly audited for accurately reflecting what was done and observed we would have an error rate which may allow to trust to a certain degree.

]]>1) Hypothesis testing via something like p-values is a reasonable way to challenge a model and drive theory development, even though it is not typically how scientists use this procedure.

2) The frequent misuse of the procedure outweighs its potential utility so it would be best to drop it entirely.

First, does this seem a reasonable description of the positions under discussion?

Second, on the basis of various other comment threads, it seems like there are two kinds of statistical applications being discussed (and people are often vague about which they mean):

A) How to challenge scientific theories with the aim of developing them further.

B) How to guide policy decisions.

While these are not incompatible, they might well be served by different procedures in different ways. For example, with regard to A, model comparison is very natural in a Bayesian framework, but model checking is a type of hypothesis test. Application B seems to me to be essentially a kind of Bayesian decision problem (as Lakeland points out), but maybe this is just lack of imagination on my part; for example, hypothesis testing could indicate whether enough evidence was available to make a decision *at all*.

Anyway, just putting this out here to try and better understand the points of contention and how we can find a path toward reconciliation (in other words, why can’t everyone just get along?).

]]>To clarify: I’m not talking about what you call the “animal.” I’m talking about a statistical method by which a researcher gathers data, computes a p-value or Bayes factor or some other data-based summary statistic, and uses that to declare that an effect is real or not, or that a hypothesis is true or not, or some other binary decision. I’m also talking about variants of this method, such as a tripartite rule under which a researcher declares that an effect is definitely true if p is less than 0.01, that an effect is small or intermittent if p is between 0.01 and 0.1, and that an effect is zero if p is greater than 0.1.

Again, it’s not an animal, it’s a statistical method, it’s a commonly used statistical method (or, to be precise, a class of statistical methods), and it has major problems.

]]>Every time you call null hypothesis testing an “animal,” I’m gonna say: no, NHST is not an animal, it’s a statistical method that thousands of researchers use every day. It’s a method with real problems, and these problems have malign consequences; see for example section 2 of this article.

]]>I hear you. I disagree that consumers have to depend entirely on what expertise holds. Who is to determine which tasks are being performed reliably? As it is consumers/patients are already sharing their experiences with each other on Facebook and Twitter. So stakeholders may be paying attention.

]]>In having lived in Boston, I did come across some who were very astute. I speculate that many do not understand and misunderstand the utility of tests and technologies. But that is also the case with specialists in many disciplines.

This is all why patients and consumers of statisticians have to become more attuned to these controversies in statistics and medical treatments.

]]>Which published reports do you recommend?

]]>Uses of logic are very helpful at times. Yet I think some dimensions of reasoning toward a ‘scientific’ query/endeavor just can’t be explained easily. Epiphanies can result randomly.

Some people can solve quite complex problems without any identifiable and seemingly necessary training. I’ve observed this in some situations.

]]>If you give a criticism that holds for actual significance testing, or error statistical hypothesis testing (N-P or F) we can consider it.

I myself reformulate, extend the tests, but that doesn’t mean the tests license the absurd criticisms raised against it

The same man who formulated N-P tests, formulated confidence intervals and in the same years ~1930.

I was referring to “NHST”, the abusive animal characterized and criticized by people for allowing moves from statistical to substantive, and allegedly giving bright line true false results. Or worse, claims that the null or alternative are “proved”. If anyone says it refers to actual significance tests as put forward either in the N-P or F form, then the criticisms don’t hold. If the criticisms hold, then people are referring to the abusive NHST animal. The latter reading is the one I’m giving because it is now so tightly ingrained in official statements. ]]>

if you use randomization, and you select a hypothesis based on a specific predicted theoretical deviation, and you base the test on the acual observed distribution of outcomes not a theoretically convenient distribution, and calculate a p value and you use the p value in a utility based decision it isn’t as good as a Bayesian analysis but it does mitigate most of the worst aspects of NHST.

the most useful use of a p value is as a screening tool to reduce the number of things you have to spend money and effort investigating. So for example you give 100 drugs to mice and screen out all but those that give a p value smaller than 0.01, and then you do careful studies of toxicity and effectiveness on just those 6 or whatever. Implicitly this is a utility based decision, you know the cost of doing the full analysis is high, and the average cost of screening out a random drug is low since most drugs are unhelpful.

]]>but no, lowering the p value threshold while letting people continue to make logical errors like straw man NHST would be unhelpful, it would simply increase the cost of doing bad science, but since the cost is born by taxpayers and not scientists, there would be no reduction in supply…

]]>No physicist I know of ever said the “Higgs boson exists with 5-sigma confidence!”. The ATLAS and CMS discovery papers reported “An excess of events is observed above the expected background, with a local significance of 5.0 standard deviations, at a mass near 125 GeV, signalling the production of a new particle” with properties that are “compatible with the production and decay of the Standard Model Higgs boson”. There exist lots of alternative models that are compatible with the observed Higgs properties but predict additional new physics (e.g. more new particles or deviations in the observed Higgs properties) that high energy physicists continue to test for.

Over the years, more and more measurements have been made of the properties of this new boson and they are all depressingly consistent with the standard model Higgs. It is true that except for the spin-parity, which is established with high confidence, most individual properties have only been tested at the 10-20% level, but there are a lot of properties measured and they are all consistent with the Higgs. Since it looks like a Higgs, walks like a Higgs, and quacks like a Higgs, we have decided to call it a Higgs, even if it eventually turns out not be not exactly the Standard Model Higgs.

It is also important to be aware of the bias of physicists when they make these tests. The existence of a Standard Model Higgs means that it is theoretically possible that there are no new fundamental particles/forces above the Higgs mass (125 GeV) all the way up to the Planck Scale (1e19 GeV!). If there is no new physics, the future of accelerator based high energy physics is very uncertain. We don’t want the properties of the observed 125 GeV boson to be consistent with the Standard Model Higgs. We want disagreement! Given this strong bias, it should be impressive that we haven’t yet found any.

]]>That’s funny coming from someone who explicitly touts induction—the prime example of illogic—as what “scientific inference” is all about in the end. I try to explain why that is fundamentally wrong, and why induction *is not needed in science*, in this comment.

https://en.m.wikipedia.org/wiki/Higgs_boson#Current_status

So the evidence for the higgs boson is far weaker than the evidence against the higgsless standard model no one believed in.

Historically, this is how NHST “infects” a field. First testing a strawman model is done in addition to science (testing your model), but slowly it takes up more and more space and importance due to the possibility of “sexy” (but wrong) claims like “Higgs boson exists with 5-sigma confidence!”

Saw the same thing with LIGO.

]]>Take the Higgs boson data. What theory was the background model derived from?

My understanding is that in a universe described by the standard model without a Higgs, the probability of certain events occurring would be greater than 1. So a new particle was theorized to “fix” this apparent contradiction in the math.

But no particular mass was predicted for this particle, so they instead tested for what expected to see in a universe where overunity probabilities are possible, and when a deviation from that was detected, they assumed it must be the Higgs. This is more like Meehl’s idea of “psychology” than “physics”.

Now, as I understand it there were some theoretically predicted properties of the Higgs, just not the mass. And whatever it is they detected had those precise properties. If thats correct, then that is why we should trust the Higgs version of the standard model. The 5 sigma deviation from a universe with overunity probabilities is irrelevant.

]]>Science understood broadly enough to include statistics requires a sort of anti-authoritarian thinking. One needs to constantly identify and question the premises of whatever one is arguing, because often the errors and confusions occur at that level.

]]>Yes, the inferential & data analysis tools are needed to reach claims, assess evidence, find things out, and highlight claims that are poorly tested (at least, thus far). While consumers need to make individual decisions, doing so is enabled only insofar as the previous tasks are performed reliably. ]]>

Mathematical and statistical skills/talent/understanding is NOT required and NOT taught by any medical school that I know of (I’m a practicing doctor who teaches residents.)

]]>The most common beginning phrase in the title of a high energy physics paper is “Search for”, and the paper almost always reports a null result. We don’t claim a discovery unless the data are very inconsistent with the null hypothesis. One reason we want high “statistical significance”(e.g. the famous “5 sigma” discovery threshold) is so we can report on all the ways we have sliced and diced the data to check for inconsistencies that might reveal systematic errors. The problem with “digging to the data” is when the default assumption is that any inconsistencies with the null hypothesis are “discoveries”, not evidence for systematic errors.

]]>Myself, I agree they should be banned entirely primarily because they lead people into silly logical traps, even if not NHST. The errors are too easy to make. However, I’m suspicious that abuse will diminish should any other method dominate. We’re already seeing all kinds of problematic articles on Bayes factors.

]]>OTOH, I haven’t had to deal with some of the really complicated situations that come up on this blog. So it’s easy for for me to say what I just did…

]]>Drug X is used by patients. A company thinks they have made an improved version of drug X, called drug Y. So they randomise 100 patients to drug X and 100 patients to drug Y. They then compare the distribution of outcomes experienced in the two trial arms and run some kind of statistical test for the difference. They might test for differences in means, log means, distributions (KS test, for example), or whatever. The test fails to reject. Then the company thinks …Ok, should we perform a larger trial? They conduct a formal or informal cost-benefit analysis, according to which running a larger trial doesn’t seem worth it. So they stop.

The only big difference between your proposal and the above example seems to be the use of randomisation, which I can’t imagine you object to. So it’s difficult for me to pinpoint where your strong objection to NHST originates.

In my own view, trial p-values should not be dichotimized by researchers, they should be reported for interpretable differences between trial arms, and they should always be accompanied by point estimates and confidence intervals. Then it should be the responsibility of regulatory agencies to decide what gets approved. Dichotimization of p-values may be an ugly but practically important contributor to the regulatory decision, which is a sausage-making process, not a statistical ideal. One thing I think generally should not contribute to the regulatory decision is cost-benefit analyses from industry, which make lying easy.

Apart from my deep dislike for industry cost-benefit analysis in medicine, I don’t hold my views strongly — I could be conviced to change them.

Finally, regarding your psych example, I agree that is bad practice, but it seems very different from medical trials.

]]>