Also, I think it’s very good that they mentioned that they did it at all. Full honest and open reporting is the only way we can really assess results.

]]>If you keep searching, you might eventually get what appears to be a significant effect with a small magnitude (smaller the longer you keep searching). Getting what appears to be an effect with large magnitude is less and less likely as the sample size increases (assuming there is no large effect to be found!).

]]>Maybe it doesn’t even matter so much if something “significant” is there, or how big/important it is.

As long as the construct gets talked about, and researched in some way or form, the machine keeps running.

As long as the machine keeps running, all kinds of things can be achieved like financial gain, political influence, media attention, etc.

Perhaps it’s all about keeping the machine running as long as possible, so you have lots of time to do, and achieve, and receive, lots of things.

]]>*Using a uniform or near uniform prior, which I should have mentioned above as well.

]]>If that doesn’t happen in this case then I don’t see what purpose the confidence interval would serve.

]]>It is easy enough to write a simulation that incorporates assumptions about how the data was collected into the null model. You don’t really need “exact” intervals.

If we’re getting down to this level of concreteness then the issue becomes how and why will you trade off coverage probabilities of possible intervals at the different possible sample sizes. This is a procedure design question; simulations alone won’t answer it.

]]>I think this problem could be partly overcome by executing, and emphasizing, more direct comparison tests of hypotheses and/or constructs to see which ones “explain” or “predict” more or better.

I consider this to be a necessary aspect of science. You will get all sorts of protestations that “it is too complicated”, etc if you bring this up. That is just laziness and lack of training in the basic skills required to model dynamic systems like calculus or programming.

]]>Exact coverage of interval procedures is no longer easy to achieve.

It is easy enough to write a simulation that incorporates assumptions about how the data was collected into the null model. You don’t really need “exact” intervals.

From the abstract:

Accumulation Bias — and invalidate the sampling distribution assumed for p-value tests, thus inflating type-I errors. But dependencies are also inevitable, since for science to accumulate efficiently, new research needs to be informed by past results.

The way science “accumulates” is by synthesizing the evidence into a theory that explains the phenomenon and can be used to derive predictions about future/other phenomenon. No cumulative progress is made by checking if this p value is less than 0.5, then that one, etc. That method is actually guaranteed to impede cumulative progress by generating conflicting results.

]]>This is of course because you are spoilt for choice in how you attack such ideas that are so irrelevant to statistical inference.

They were irrelevant when they were first proposed by Neyman and Pearson in the 1930’s, they are irrelevant today, and were irrelevant in the time in between.

If you want the probability of an error, you should want the probability of making an error of ANY kind (which of course is NOT equal to adding together the probabilities of the Type I and Type II errors).

Please note that the use of Type I and Type II errors was not advocated or endorsed by Fisher (because he was actually a statistician, unlike Neyman who was basically just a mathematician).

]]>And a large proportion of the data collection that occurs in science is contingent on past results! So what happens when one tries to account for contingent data collection in the accumulation of knowledge? Here’s a quote from the abstract of the recent paper Accumulation Bias in Meta-Analysis: The Need to Consider Time in Error Control:

We introduce an Accumulation Bias Framework that allows us to model a wide variety of practically occurring dependencies, including study series accumulation, meta-analysis timing, and approaches to multiple testing in living systematic reviews. The strength of this framework is that it shows how all dependencies affect p-value-based tests in a similar manner. This leads to two main conclusions. First, Accumulation Bias is inevitable, and even if it can be approximated and accounted for, no valid p-value tests can be constructed. Second, tests based on likelihood ratios withstand Accumulation Bias: they provide bounds on error probabilities that remain valid despite the bias. We leave the reader with a choice between two proposals to consider time in error control: either treat individual (primary) studies and meta-analyses as two separate worlds — each with their own timing — or integrate individual studies in the meta-analysis world. Taking up likelihood ratios in either approach allows for valid tests that relate well to the accumulating nature of scientific knowledge.

The quest for error control leads back to likelihood ratios.

]]>Interesting comment in relation to proposed large scale “collaborative” efforts that could (or will?) “revisit” all the social science classics of the last decades, like the IAT-stuff, stereotype threat, etc. If you keep spending resources on it, you’ll eventually will find something “significant” there! (who cares if you could possibly spend your resources on something more valid, important, useful, predictive, explanatory, etc.)

Perhaps in line with (the gist of) your comment, i have recently also started to wonder if repeatedly focusing on the same few specific constructs in your experiments, conclusions, and causal mapping of things, might be some form of this problematic issue as well.

If “everything is somewhat correlated with everything” (see Meehl, 1990), what specific construct you (aim to) measure in your experiments will necessarily determine large parts of the conclusions i reason. But if that specific construct is (probably?) related to a whole lot of other constructs which are not measured, how is it possible to draw any real valid conclusions this way?

I think this problem could be partly overcome by executing, and emphasizing, more direct comparison tests of hypotheses and/or constructs to see which ones “explain” or “predict” more or better. I think this could be a crucial part of the (optimal) scientific process in social science, but i am not sure. This is also because i haven’t heard or read much about it in all the recent discussions about how to improve matters, which i find very strange…

]]>All statistical significance has been measuring is the collective prior belief that there is a interesting relationship between some variables. A high prior means more resources will be expended to “detect” the relationship (increase sample size, etc to get significance). A low prior leads to few resources and often failure to get significance.

]]>Thanks, ]]>