Comments on: Just forget the Type 1 error thing.

By: psyoskeptic

psyoskeptic — Thu, 28 Nov 2019 05:52:50 +0000

This looks like a response to the quote. In that case, I think you’re ignoring that the authors are arguing that they want more precise estimates of the effect and didn’t run extra participants for statistical significance. In research following that example N will be correlated with variance so that the larger N studies will have inflated variances, but it’s not a very big bias (ran a few simulations). Type I won’t be affected at all. (And a narrow CI helps one actually say something when there’s no effect.)

Also, I think it’s very good that they mentioned that they did it at all. Full honest and open reporting is the only way we can really assess results.

By: Carlos Ungil

Carlos Ungil — Thu, 19 Sep 2019 21:31:46 +0000

In reply to Deborah G. Mayo. > So it’s fine to keep searching until you get what appears to be an effect with large magnitude? If you keep searching, you might eventually get what appears to be a significant effect with a small magnitude (smaller the longer you keep searching). Getting what appears to be an effect with large magnitude is less and less likely as the sample size increases (assuming there is no large effect to be found!).

By: Deborah G. Mayo

Deborah G. Mayo — Thu, 19 Sep 2019 21:05:58 +0000

In reply to The naked statistician. Naked Stat: Fisher surely cared about error probabilities–N-P followed him. He didn't want type II error probabilities to be necessarily, explicitly required in all cases. Of course, his antipathy for Neyman after 1935 obscures how much of what he said was just anti-power because Neyman saw that without considering the alternative, it's easy to finagle rejections of the null. The central problems people find with tests today stem from failing to be explicit about the alternative. Whether or not the alt is made explicit in formulating the test (it typically isn't in the case of simple significance tests), one still needs to be explicit about what inference one is entitled to draw upon finding significance–this holds in testing assumptions as well.

By: Deborah G. Mayo

Deborah G. Mayo — Thu, 19 Sep 2019 21:00:01 +0000

In reply to The naked statistician. They were quite relevant then, and will continue to be, insofar as we care to distinguish results of expected chance variability, & want to block being misled with high probability.

By: Deborah G. Mayo

Deborah G. Mayo — Thu, 19 Sep 2019 20:57:47 +0000

So it’s fine to keep searching until you get what appears to be an effect with large magnitude? Error probabilities, used correctly don’t just control error probabilities in the long-run, but indicate if a lousy job has been done in the case at hand (in avoiding misconstruing noise as genuine).

By: Anonymous

Anonymous — Wed, 31 Jul 2019 18:07:19 +0000

In reply to Anonymous.

“If you keep spending resources on it, you’ll eventually will find something “significant” there! (who cares if you could possibly spend your resources on something more valid, important, useful, predictive, explanatory, etc.)”

Maybe it doesn’t even matter so much if something “significant” is there, or how big/important it is.

As long as the construct gets talked about, and researched in some way or form, the machine keeps running.

As long as the machine keeps running, all kinds of things can be achieved like financial gain, political influence, media attention, etc.

Perhaps it’s all about keeping the machine running as long as possible, so you have lots of time to do, and achieve, and receive, lots of things.

By: Anoneuoid

Anoneuoid — Wed, 31 Jul 2019 14:03:19 +0000

In reply to Anoneuoid. I see what you mean. It's just not something I would care about so failed to read your post like that.

By: Corey

Corey — Wed, 31 Jul 2019 12:24:07 +0000

In reply to Anoneuoid. I mean I don't disagree, but the topic I'm addressing is combining sampling-theory-based methods in general with the idea that "it’s completely fine to add participants after looking at the result".

By: Anoneuoid

Anoneuoid — Wed, 31 Jul 2019 06:38:52 +0000

In reply to Anoneuoid. Actually, is there a case where the confidence interval and credible intervals* differ substantially but the confidence interval makes more sense? All the examples I'm finding only seem to show the confidence interval includes impossible values, etc. *Using a uniform or near uniform prior, which I should have mentioned above as well.

By: Anoneuoid

Anoneuoid — Wed, 31 Jul 2019 05:50:41 +0000

In reply to Corey. I was thinking of just modelling the process I believed generated the data and getting a credible interval. I use confidence intervals only because they often are computationally cheap means to approximate the corresponding credible interval. The credible interval would also approximate the coverage of the confidence interval in those cases. If that doesn't happen in this case then I don't see what purpose the confidence interval would serve.

By: Corey

Corey — Wed, 31 Jul 2019 03:02:38 +0000

In reply to Anoneuoid. Okay, simplest case. The model is normal, mean unknown, variance 1. The stopping rule: if the first datum is less than 1.6, stop; otherwise collect a second datum. The problem: specify a 95% one-sided upper confidence procedure. You need to decide, for each possible value of the mean, where you will put the upper bound for each of the two possible realized sample sizes. The coverage constraint says that for each possible value of the mean, the probability of exceeding the upper bounds is 5%. But you have two upper bounds, one for each possible sample size, so unlike in the fixed sample size design, the coverage constraint alone does not determine where the upper bounds are.

By: Anoneuoid

Anoneuoid — Wed, 31 Jul 2019 01:46:56 +0000

In reply to Corey. The sample size should be a parameter in the simulation, not sure what you mean.

By: Anoneuoid

Anoneuoid — Wed, 31 Jul 2019 01:40:46 +0000

In reply to Keith O'Rourke. So obviously you need to simulate the publication bias. It isn't hard to do that, but some of parameters would be unknown. In that case the correct answer is we are just very uncertain as to what a given collection of literature means. I suspect that will be the case in general, most of these studies were a waste of time.

By: Martha (Smith)

Martha (Smith) — Tue, 30 Jul 2019 21:18:03 +0000

In reply to Keith O'Rourke. +1

By: Keith O'Rourke

Keith O'Rourke — Tue, 30 Jul 2019 19:16:07 +0000

In reply to Anoneuoid. The problem with actual meta-analyses is that there is a lot of uncertainty about whether what is reported in studies represents what actually was done and happened. As well as completely unreported studies.

By: Corey

Corey — Tue, 30 Jul 2019 19:09:07 +0000

In reply to Anoneuoid.

It is easy enough to write a simulation that incorporates assumptions about how the data was collected into the null model. You don’t really need “exact” intervals.

If we're getting down to this level of concreteness then the issue becomes how and why will you trade off coverage probabilities of possible intervals at the different possible sample sizes. This is a procedure design question; simulations alone won't answer it.

By: Anoneuoid

Anoneuoid — Tue, 30 Jul 2019 18:22:22 +0000

In reply to Anonymous.

I think this problem could be partly overcome by executing, and emphasizing, more direct comparison tests of hypotheses and/or constructs to see which ones “explain” or “predict” more or better.

I consider this to be a necessary aspect of science. You will get all sorts of protestations that "it is too complicated", etc if you bring this up. That is just laziness and lack of training in the basic skills required to model dynamic systems like calculus or programming.

By: Anoneuoid

Anoneuoid — Tue, 30 Jul 2019 18:18:28 +0000

In reply to Corey.

Exact coverage of interval procedures is no longer easy to achieve.

It is easy enough to write a simulation that incorporates assumptions about how the data was collected into the null model. You don't really need "exact" intervals. From the abstract:

Accumulation Bias --- and invalidate the sampling distribution assumed for p-value tests, thus inflating type-I errors. But dependencies are also inevitable, since for science to accumulate efficiently, new research needs to be informed by past results.

The way science "accumulates" is by synthesizing the evidence into a theory that explains the phenomenon and can be used to derive predictions about future/other phenomenon. No cumulative progress is made by checking if this p value is less than 0.5, then that one, etc. That method is actually guaranteed to impede cumulative progress by generating conflicting results.

By: The naked statistician

The naked statistician — Tue, 30 Jul 2019 16:27:31 +0000

It is very difficult to criticise the use of the concepts of a Type I and Type II error.

This is of course because you are spoilt for choice in how you attack such ideas that are so irrelevant to statistical inference.

They were irrelevant when they were first proposed by Neyman and Pearson in the 1930’s, they are irrelevant today, and were irrelevant in the time in between.

If you want the probability of an error, you should want the probability of making an error of ANY kind (which of course is NOT equal to adding together the probabilities of the Type I and Type II errors).

Please note that the use of Type I and Type II errors was not advocated or endorsed by Fisher (because he was actually a statistician, unlike Neyman who was basically just a mathematician).

By: Corey

Corey — Tue, 30 Jul 2019 15:34:03 +0000

This issue -- data collection contingent on results up to the current moment -- affects more than just the Type I error rate. Contingent data collection changes the sampling distribution of the data, so all sampling-theory-based methods of statistical inference must account for it. Estimators become biased in predictable directions and we're not talking about the nice kind of bias that pays its rent by reducing variance and hence mean squared error. (What are the standard errors of these estimators anyway?) Exact coverage of interval procedures is no longer easy to achieve. And a large proportion of the data collection that occurs in science is contingent on past results! So what happens when one tries to account for contingent data collection in the accumulation of knowledge? Here's a quote from the abstract of the recent paper Accumulation Bias in Meta-Analysis: The Need to Consider Time in Error Control:

We introduce an Accumulation Bias Framework that allows us to model a wide variety of practically occurring dependencies, including study series accumulation, meta-analysis timing, and approaches to multiple testing in living systematic reviews. The strength of this framework is that it shows how all dependencies affect p-value-based tests in a similar manner. This leads to two main conclusions. First, Accumulation Bias is inevitable, and even if it can be approximated and accounted for, no valid p-value tests can be constructed. Second, tests based on likelihood ratios withstand Accumulation Bias: they provide bounds on error probabilities that remain valid despite the bias. We leave the reader with a choice between two proposals to consider time in error control: either treat individual (primary) studies and meta-analyses as two separate worlds --- each with their own timing --- or integrate individual studies in the meta-analysis world. Taking up likelihood ratios in either approach allows for valid tests that relate well to the accumulating nature of scientific knowledge.

The quest for error control leads back to likelihood ratios.

By: Anonymous

Anonymous — Tue, 30 Jul 2019 14:38:25 +0000

In reply to Anoneuoid.

Quote from above: “All statistical significance has been measuring is the collective prior belief that there is a interesting relationship between some variables. A high prior means more resources will be expended to “detect” the relationship (increase sample size, etc to get significance). A low prior leads to few resources and often failure to get significance.”

Interesting comment in relation to proposed large scale “collaborative” efforts that could (or will?) “revisit” all the social science classics of the last decades, like the IAT-stuff, stereotype threat, etc. If you keep spending resources on it, you’ll eventually will find something “significant” there! (who cares if you could possibly spend your resources on something more valid, important, useful, predictive, explanatory, etc.)

Perhaps in line with (the gist of) your comment, i have recently also started to wonder if repeatedly focusing on the same few specific constructs in your experiments, conclusions, and causal mapping of things, might be some form of this problematic issue as well.

If “everything is somewhat correlated with everything” (see Meehl, 1990), what specific construct you (aim to) measure in your experiments will necessarily determine large parts of the conclusions i reason. But if that specific construct is (probably?) related to a whole lot of other constructs which are not measured, how is it possible to draw any real valid conclusions this way?

I think this problem could be partly overcome by executing, and emphasizing, more direct comparison tests of hypotheses and/or constructs to see which ones “explain” or “predict” more or better. I think this could be a crucial part of the (optimal) scientific process in social science, but i am not sure. This is also because i haven’t heard or read much about it in all the recent discussions about how to improve matters, which i find very strange…

By: Anoneuoid

Anoneuoid — Tue, 30 Jul 2019 13:29:31 +0000

Yep, type I error rate is zero because the null hypothesis is false. The incorrect 5% value is calculated by assuming that falsehood is true, and no amount of mathematical manipulation can ever overcome that fact. So, in real life “applications” of NHST you have only true positives and false negatives (type II errors).

All statistical significance has been measuring is the collective prior belief that there is a interesting relationship between some variables. A high prior means more resources will be expended to “detect” the relationship (increase sample size, etc to get significance). A low prior leads to few resources and often failure to get significance.

By: James

James — Tue, 30 Jul 2019 13:12:14 +0000

Andrew, small point of clarification. Are you advocating that the idea of type I and II error is only unhelpful in the setting of identifying non null effects described here or do you feel similarly about them in classification settings? My algorithm said this image contained an apple but it did not)
Thanks,