This looks like a response to the quote. In that case, I think you’re ignoring that the authors are arguing that they want more precise estimates of the effect and didn’t run extra participants for statistical significance. In research following that example N will be correlated with variance so that the larger N studies will have inflated variances, but it’s not a very big bias (ran a few simulations). Type I won’t be affected at all. (And a narrow CI helps one actually say something when there’s no effect.)

Also, I think it’s very good that they mentioned that they did it at all. Full honest and open reporting is the only way we can really assess results.

]]>> So it’s fine to keep searching until you get what appears to be an effect with large magnitude?

If you keep searching, you might eventually get what appears to be a significant effect with a small magnitude (smaller the longer you keep searching). Getting what appears to be an effect with large magnitude is less and less likely as the sample size increases (assuming there is no large effect to be found!).

]]>Naked Stat: Fisher surely cared about error probabilities–N-P followed him. He didn’t want type II error probabilities to be necessarily, explicitly required in all cases. Of course, his antipathy for Neyman after 1935 obscures how much of what he said was just anti-power because Neyman saw that without considering the alternative, it’s easy to finagle rejections of the null. The central problems people find with tests today stem from failing to be explicit about the alternative. Whether or not the alt is made explicit in formulating the test (it typically isn’t in the case of simple significance tests), one still needs to be explicit about what inference one is entitled to draw upon finding significance–this holds in testing assumptions as well.

]]>They were quite relevant then, and will continue to be, insofar as we care to distinguish results of expected chance variability, & want to block being misled with high probability.

]]>“If you keep spending resources on it, you’ll eventually will find something “significant” there! (who cares if you could possibly spend your resources on something more valid, important, useful, predictive, explanatory, etc.)”

Maybe it doesn’t even matter so much if something “significant” is there, or how big/important it is.

As long as the construct gets talked about, and researched in some way or form, the machine keeps running.

As long as the machine keeps running, all kinds of things can be achieved like financial gain, political influence, media attention, etc.

Perhaps it’s all about keeping the machine running as long as possible, so you have lots of time to do, and achieve, and receive, lots of things.

]]>I see what you mean. It’s just not something I would care about so failed to read your post like that.

]]>I mean I don’t disagree, but the topic I’m addressing is combining sampling-theory-based methods in general with the idea that “it’s *completely fine* to add participants after looking at the result”.

Actually, is there a case where the confidence interval and credible intervals* differ substantially but the confidence interval makes more sense? All the examples I’m finding only seem to show the confidence interval includes impossible values, etc.

*Using a uniform or near uniform prior, which I should have mentioned above as well.

]]>I was thinking of just modelling the process I believed generated the data and getting a credible interval. I use confidence intervals only because they often are computationally cheap means to approximate the corresponding credible interval. The credible interval would also approximate the coverage of the confidence interval in those cases.

If that doesn’t happen in this case then I don’t see what purpose the confidence interval would serve.

]]>Okay, simplest case. The model is normal, mean unknown, variance 1. The stopping rule: if the first datum is less than 1.6, stop; otherwise collect a second datum. The problem: specify a 95% one-sided upper confidence procedure. You need to decide, for each possible value of the mean, where you will put the upper bound for each of the two possible realized sample sizes. The coverage constraint says that for each possible value of the mean, the probability of exceeding the upper bounds is 5%. But you have two upper bounds, one for each possible sample size, so unlike in the fixed sample size design, the coverage constraint alone does not determine where the upper bounds are.

]]>The sample size should be a parameter in the simulation, not sure what you mean.

]]>So obviously you need to simulate the publication bias. It isn’t hard to do that, but some of parameters would be unknown. In that case the correct answer is we are just very uncertain as to what a given collection of literature means. I suspect that will be the case in general, most of these studies were a waste of time.

]]>+1

]]>The problem with actual meta-analyses is that there is a lot of uncertainty about whether what is reported in studies represents what actually was done and happened. As well as completely unreported studies.

]]>It is easy enough to write a simulation that incorporates assumptions about how the data was collected into the null model. You don’t really need “exact” intervals.

If we’re getting down to this level of concreteness then the issue becomes how and why will you trade off coverage probabilities of possible intervals at the different possible sample sizes. This is a procedure design question; simulations alone won’t answer it.

]]>I think this problem could be partly overcome by executing, and emphasizing, more direct comparison tests of hypotheses and/or constructs to see which ones “explain” or “predict” more or better.

I consider this to be a necessary aspect of science. You will get all sorts of protestations that “it is too complicated”, etc if you bring this up. That is just laziness and lack of training in the basic skills required to model dynamic systems like calculus or programming.

]]>Exact coverage of interval procedures is no longer easy to achieve.

It is easy enough to write a simulation that incorporates assumptions about how the data was collected into the null model. You don’t really need “exact” intervals.

From the abstract:

Accumulation Bias — and invalidate the sampling distribution assumed for p-value tests, thus inflating type-I errors. But dependencies are also inevitable, since for science to accumulate efficiently, new research needs to be informed by past results.

The way science “accumulates” is by synthesizing the evidence into a theory that explains the phenomenon and can be used to derive predictions about future/other phenomenon. No cumulative progress is made by checking if this p value is less than 0.5, then that one, etc. That method is actually guaranteed to impede cumulative progress by generating conflicting results.

]]>This is of course because you are spoilt for choice in how you attack such ideas that are so irrelevant to statistical inference.

They were irrelevant when they were first proposed by Neyman and Pearson in the 1930’s, they are irrelevant today, and were irrelevant in the time in between.

If you want the probability of an error, you should want the probability of making an error of ANY kind (which of course is NOT equal to adding together the probabilities of the Type I and Type II errors).

Please note that the use of Type I and Type II errors was not advocated or endorsed by Fisher (because he was actually a statistician, unlike Neyman who was basically just a mathematician).

]]>And a large proportion of the data collection that occurs in science is contingent on past results! So what happens when one tries to account for contingent data collection in the accumulation of knowledge? Here’s a quote from the abstract of the recent paper Accumulation Bias in Meta-Analysis: The Need to Consider Time in Error Control:

We introduce an Accumulation Bias Framework that allows us to model a wide variety of practically occurring dependencies, including study series accumulation, meta-analysis timing, and approaches to multiple testing in living systematic reviews. The strength of this framework is that it shows how all dependencies affect p-value-based tests in a similar manner. This leads to two main conclusions. First, Accumulation Bias is inevitable, and even if it can be approximated and accounted for, no valid p-value tests can be constructed. Second, tests based on likelihood ratios withstand Accumulation Bias: they provide bounds on error probabilities that remain valid despite the bias. We leave the reader with a choice between two proposals to consider time in error control: either treat individual (primary) studies and meta-analyses as two separate worlds — each with their own timing — or integrate individual studies in the meta-analysis world. Taking up likelihood ratios in either approach allows for valid tests that relate well to the accumulating nature of scientific knowledge.

The quest for error control leads back to likelihood ratios.

]]>Quote from above: “All statistical significance has been measuring is the collective prior belief that there is a interesting relationship between some variables. A high prior means more resources will be expended to “detect” the relationship (increase sample size, etc to get significance). A low prior leads to few resources and often failure to get significance.”

Interesting comment in relation to proposed large scale “collaborative” efforts that could (or will?) “revisit” all the social science classics of the last decades, like the IAT-stuff, stereotype threat, etc. If you keep spending resources on it, you’ll eventually will find something “significant” there! (who cares if you could possibly spend your resources on something more valid, important, useful, predictive, explanatory, etc.)

Perhaps in line with (the gist of) your comment, i have recently also started to wonder if repeatedly focusing on the same few specific constructs in your experiments, conclusions, and causal mapping of things, might be some form of this problematic issue as well.

If “everything is somewhat correlated with everything” (see Meehl, 1990), what specific construct you (aim to) measure in your experiments will necessarily determine large parts of the conclusions i reason. But if that specific construct is (probably?) related to a whole lot of other constructs which are not measured, how is it possible to draw any real valid conclusions this way?

I think this problem could be partly overcome by executing, and emphasizing, more direct comparison tests of hypotheses and/or constructs to see which ones “explain” or “predict” more or better. I think this could be a crucial part of the (optimal) scientific process in social science, but i am not sure. This is also because i haven’t heard or read much about it in all the recent discussions about how to improve matters, which i find very strange…

]]>All statistical significance has been measuring is the collective prior belief that there is a interesting relationship between some variables. A high prior means more resources will be expended to “detect” the relationship (increase sample size, etc to get significance). A low prior leads to few resources and often failure to get significance.

]]>Thanks, ]]>