John Christie writes:

I was reading this paper by Habibnezhad, Lawrence, & Klein (2018) and came across the following footnote:

In a research program seeking to apply null-hypothesis testing to achieve one-off decisions with regard to the presence/absence of an effect, a flexible stopping-rule would induce inflation of the Type I error rate. Although our decision to double the N from 20 to 40 to reduce the 95% CI is not such a flexible stopping rule, it might increase the Type I error rate. That noted, we are not proposing any such one-off decisions, but instead seek to contribute to the cumulative evidence of the scientific process. Those seeking such decisions may consider the current report exploratory rather than confirmatory. (fn 2)

Given the recent strong recommendations by many against adding participants after looking at the result I wonder if you feel the footnote is sufficient or if you wanted to comment on it on your blog.

My quick reply is that I hate this type 1 error thing.

Let me explain in the context of a simple example. Consider two classical designs:

1. N=20 experiment

2. N=2,000,000 experiment.

Both these have “type 1 error rates” of 0.05, but experiment #2 will be much more likely to give statistical significance. Who cares about the type 1 error rate? I don’t. The null hypothesis of zero effect and zero systematic error is always false.

To put it another way: it’s *completely fine* to add participants after looking at the result. The goal should not be to get “statistical significance” or to get 95% intervals that exclude zero or whatever. Once you forget that, you can move forward.

But now let’s step back and consider the motivation for type 1 error control in the first place. The concern is that if you don’t control type 1 error, you’ll routinely jump to conclusions. I’d prefer to frame this in terms of type M (magnitude) and type S (sign) errors. I think the way to avoid jumping to unwarranted conclusions is by making each statement stand up on its own. To put it another way, I have no problem presenting a thousand 95% intervals, under the expectation that 50 will not contain the true value.

Andrew, small point of clarification. Are you advocating that the idea of type I and II error is only unhelpful in the setting of identifying non null effects described here or do you feel similarly about them in classification settings? My algorithm said this image contained an apple but it did not)

Thanks,

Yep, type I error rate is zero because the null hypothesis is false. The incorrect 5% value is calculated by assuming that falsehood is true, and no amount of mathematical manipulation can ever overcome that fact. So, in real life “applications” of NHST you have only true positives and false negatives (type II errors).

All statistical significance has been measuring is the collective prior belief that there is a interesting relationship between some variables. A high prior means more resources will be expended to “detect” the relationship (increase sample size, etc to get significance). A low prior leads to few resources and often failure to get significance.

Quote from above: “All statistical significance has been measuring is the collective prior belief that there is a interesting relationship between some variables. A high prior means more resources will be expended to “detect” the relationship (increase sample size, etc to get significance). A low prior leads to few resources and often failure to get significance.”

Interesting comment in relation to proposed large scale “collaborative” efforts that could (or will?) “revisit” all the social science classics of the last decades, like the IAT-stuff, stereotype threat, etc. If you keep spending resources on it, you’ll eventually will find something “significant” there! (who cares if you could possibly spend your resources on something more valid, important, useful, predictive, explanatory, etc.)

Perhaps in line with (the gist of) your comment, i have recently also started to wonder if repeatedly focusing on the same few specific constructs in your experiments, conclusions, and causal mapping of things, might be some form of this problematic issue as well.

If “everything is somewhat correlated with everything” (see Meehl, 1990), what specific construct you (aim to) measure in your experiments will necessarily determine large parts of the conclusions i reason. But if that specific construct is (probably?) related to a whole lot of other constructs which are not measured, how is it possible to draw any real valid conclusions this way?

I think this problem could be partly overcome by executing, and emphasizing, more direct comparison tests of hypotheses and/or constructs to see which ones “explain” or “predict” more or better. I think this could be a crucial part of the (optimal) scientific process in social science, but i am not sure. This is also because i haven’t heard or read much about it in all the recent discussions about how to improve matters, which i find very strange…

I consider this to be a necessary aspect of science. You will get all sorts of protestations that “it is too complicated”, etc if you bring this up. That is just laziness and lack of training in the basic skills required to model dynamic systems like calculus or programming.

“If you keep spending resources on it, you’ll eventually will find something “significant” there! (who cares if you could possibly spend your resources on something more valid, important, useful, predictive, explanatory, etc.)”

Maybe it doesn’t even matter so much if something “significant” is there, or how big/important it is.

As long as the construct gets talked about, and researched in some way or form, the machine keeps running.

As long as the machine keeps running, all kinds of things can be achieved like financial gain, political influence, media attention, etc.

Perhaps it’s all about keeping the machine running as long as possible, so you have lots of time to do, and achieve, and receive, lots of things.

This issue — data collection contingent on results up to the current moment — affects more than just the Type I error rate. Contingent data collection changes the sampling distribution of the data, so all sampling-theory-based methods of statistical inference must account for it. Estimators become biased in predictable directions and we’re not talking about the nice kind of bias that pays its rent by reducing variance and hence mean squared error. (What are the standard errors of these estimators anyway?) Exact coverage of interval procedures is no longer easy to achieve.

And a large proportion of the data collection that occurs in science is contingent on past results! So what happens when one tries to account for contingent data collection in the accumulation of knowledge? Here’s a quote from the abstract of the recent paper Accumulation Bias in Meta-Analysis: The Need to Consider Time in Error Control:

The quest for error control leads back to likelihood ratios.

It is easy enough to write a simulation that incorporates assumptions about how the data was collected into the null model. You don’t really need “exact” intervals.

From the abstract:

The way science “accumulates” is by synthesizing the evidence into a theory that explains the phenomenon and can be used to derive predictions about future/other phenomenon. No cumulative progress is made by checking if this p value is less than 0.5, then that one, etc. That method is actually guaranteed to impede cumulative progress by generating conflicting results.

If we’re getting down to this level of concreteness then the issue becomes how and why will you trade off coverage probabilities of possible intervals at the different possible sample sizes. This is a procedure design question; simulations alone won’t answer it.

The sample size should be a parameter in the simulation, not sure what you mean.

Okay, simplest case. The model is normal, mean unknown, variance 1. The stopping rule: if the first datum is less than 1.6, stop; otherwise collect a second datum. The problem: specify a 95% one-sided upper confidence procedure. You need to decide, for each possible value of the mean, where you will put the upper bound for each of the two possible realized sample sizes. The coverage constraint says that for each possible value of the mean, the probability of exceeding the upper bounds is 5%. But you have two upper bounds, one for each possible sample size, so unlike in the fixed sample size design, the coverage constraint alone does not determine where the upper bounds are.

I was thinking of just modelling the process I believed generated the data and getting a credible interval. I use confidence intervals only because they often are computationally cheap means to approximate the corresponding credible interval. The credible interval would also approximate the coverage of the confidence interval in those cases.

If that doesn’t happen in this case then I don’t see what purpose the confidence interval would serve.

Actually, is there a case where the confidence interval and credible intervals* differ substantially but the confidence interval makes more sense? All the examples I’m finding only seem to show the confidence interval includes impossible values, etc.

*Using a uniform or near uniform prior, which I should have mentioned above as well.

I mean I don’t disagree, but the topic I’m addressing is combining sampling-theory-based methods in general with the idea that “it’s

completely fineto add participants after looking at the result”.I see what you mean. It’s just not something I would care about so failed to read your post like that.

The problem with actual meta-analyses is that there is a lot of uncertainty about whether what is reported in studies represents what actually was done and happened. As well as completely unreported studies.

+1

So obviously you need to simulate the publication bias. It isn’t hard to do that, but some of parameters would be unknown. In that case the correct answer is we are just very uncertain as to what a given collection of literature means. I suspect that will be the case in general, most of these studies were a waste of time.

It is very difficult to criticise the use of the concepts of a Type I and Type II error.

This is of course because you are spoilt for choice in how you attack such ideas that are so irrelevant to statistical inference.

They were irrelevant when they were first proposed by Neyman and Pearson in the 1930’s, they are irrelevant today, and were irrelevant in the time in between.

If you want the probability of an error, you should want the probability of making an error of ANY kind (which of course is NOT equal to adding together the probabilities of the Type I and Type II errors).

Please note that the use of Type I and Type II errors was not advocated or endorsed by Fisher (because he was actually a statistician, unlike Neyman who was basically just a mathematician).

They were quite relevant then, and will continue to be, insofar as we care to distinguish results of expected chance variability, & want to block being misled with high probability.

Naked Stat: Fisher surely cared about error probabilities–N-P followed him. He didn’t want type II error probabilities to be necessarily, explicitly required in all cases. Of course, his antipathy for Neyman after 1935 obscures how much of what he said was just anti-power because Neyman saw that without considering the alternative, it’s easy to finagle rejections of the null. The central problems people find with tests today stem from failing to be explicit about the alternative. Whether or not the alt is made explicit in formulating the test (it typically isn’t in the case of simple significance tests), one still needs to be explicit about what inference one is entitled to draw upon finding significance–this holds in testing assumptions as well.

So it’s fine to keep searching until you get what appears to be an effect with large magnitude? Error probabilities, used correctly don’t just control error probabilities in the long-run, but indicate if a lousy job has been done in the case at hand (in avoiding misconstruing noise as genuine).

> So it’s fine to keep searching until you get what appears to be an effect with large magnitude?

If you keep searching, you might eventually get what appears to be a significant effect with a small magnitude (smaller the longer you keep searching). Getting what appears to be an effect with large magnitude is less and less likely as the sample size increases (assuming there is no large effect to be found!).