Nick Longford on Type S and Type M errors

Nick Longford has some comments on my paper with Francis Tuerlinckx on Type S errors. Nick links to this paper of his and also has some comments on our paper:

I have now read the manuscript Gelman & Tuerlinckx about S error rates.
I am sorry I am unrepentent; I regard hypothesis testing as a ritual
problematic in all contexts: rates of errors are not as relevant as the
consequences of making errors. The consequences are often difficult to
assess or evaluate, but that is hardly an excuse to reduce our analysis to
something less relevant. The flip side: everybody who is anybody does use
hypothesis testing.

Hypothesis tests and related means of making decisions in the presence
of uncertainty (AIC, BIC, …, ZIC) are well oiled steam engines. It is a
heroic enterprise to make them fly, even with the latest microchip technology.
All this said and meant, testing for $\theta_1 = \theta_2$ is the ritual of
all rituals, and I am still looking for an example where it makes sense, and
is conducted genuinely for a reason other than `because that’s what we had
in our Stats course’. The appropriate adjustment is the hypothesis that
the absolute difference is small — the analyst should define `small’ in
discussion with the client. Now the parameter space is split into three
subsets (<, ~, >), and the analysis has nine potential results:
three types of agreement (<<, ~~, >>), four types of error (<~,~<,>~,~>)
and two types of awful error (<>, ><). Given all my objections, I would find this extension of S type errors a bit more palatable. This would highlight the fact that some errors are more serious than others. Now one could say, e.g., that two of the subsets are possible but the third is unlikely. It takes a lot of client's integrity to admit that 'I do not know' (failure to reject H_0 in the old-fashioned setting), and this conclusion is usually corrupted to acting as if there was no difference. A scientifically more palatable conclusion is: ... we are confident that it is small (relevant e.g., in bioequivalence studies, as required at present by FDA). Comparisons of Bayes and frequentist: I am not sure that the comparisons are made on equal footing. Prior has to be specified with Bayes, and that gives it an informational advantage over the frequentist. However, the frequentist can also use a prior. For example, it could be converted into a `synthetic' additional set of observations (with fractional sample size if necessary). Or simpler: the likelihood is maximised with the prior added as a factor. Multiple comparisons: An added compexity is that when multiple comparisons are applied less formally, a difficulty arises that the manner in which the results are reported defies any straightforward description, but it certainly is not `fail if there is at least one fail' among the tests. Answer: protocol for reporting (including `some tests are more important than others'). But that makes much of the nice probability calculations not very useful. Notwithstanding all my moans, this is a good paper and I would accept it, with ideological objections, subject to more discussion, if I had a journal.

He adds the following:

Add reference which I regard as my
Manifesto (“Certainty is the opium of the masses.” Marcus Carlsson, but
for this occasion it should be “Hypothesis testing is the opium of the statisticians”.) A generic alternative solution to hippo testing: estimate the indicator of the decision (such as the sign: + … 1;
– … -1, a ripe area for the EM algorithm).

Story to go with this: The year is 2044, and Haji Shehu, an Albanian long-jumper returns from the Olympic Games in Tegucigalpa
with a gold medal. He jumped 8 metres 85 in a rainstorm, and would
have got the gold medal even with his fifth best jump of 8.40. His
next goal is to make at least as much money as his football-playing
friends in Italy whom he regards as far from perfect — the striker
fails to convert a penalty, the goalkeeper misses the ball from a
corner, etc. And here is his chance. The bank that sponsors him
(he is still an amateur) offers him a role in a commercial where he
has to jump over a deep precipice, from one ledge to another 7 metres
apart. Well adviced, he does not agree, correctly anticipating that the bank would raise the offer. There is nobody else who could do this; all the other world class jumpers are either black or Asian, and in Albania … After some time, the bank calls a meeting and their statistician describes the analysis she’s conducted, concluding with a
so-called p-value for the hypothesis of failure to clear 7 metres, based
on a well fitted (and thoroughly checked) time series model. The final verdict: p=0.00000001. The long jumper hires his own statistician, and he confirms that all the calculations and conclusions are textbook correct.

In 2044, statistics is still irrelevant because it is not interested in the consequences of making incorrect decisions. The principal factor, comparison of the long-jumper’s values with regard to his own life (failure) and loss of income, are still ignored because they are, in essence, micro-economics. This is a good reason for not being a statistician.