We all should realise empirical studies no matter how well done and analysed will misled us a certain percentage of the time. We will never know when (at least given a single study regarding something really unknown, such as the fairness of an ordinary coin flip) nor how often. As Mosteller and Tukey once put it with a single study, you simply cannot assess the real uncertainty. It is beyond observation. With multiple studies the is some real access to it (the how often wrong _should_ be less) but its only a single set of studies.

Being certain you are not wrong about the sample size being adequate is not possible. All you can to do is form your best judgement and make a bet. As Oliver Wendell Holmes put it, we can never be more than a bettabilitarian.

]]>How about this. I flip an ordinary coin 1000 times and I get heads 540 times, vs only 460 tails, obviously. The difference is easily ‘statistically significant by standard measures’, indeed it’s significant at the 0.5% level for a one-tailed test or the 1% level for a two-tailed test. Taken on its own, this is very strong evidence that the coin is biased towards heads.

But I’ve already told you it’s an ordinary coin. As Andrew has discussed somewhere, an ordinary coin can be a little teeny bit biased, but there’s no way to make one that will give you 54/46. I don’t even think you could do 51/49 without doing something very non-ordinary like an extremely beveled edge that will favor heads when the coin bounces, and maybe not even then.

So haven’t I just contradicted myself? I’ve got an experimental result that the coin is strongly biased and the result is statistically significant at the 1% level, but I’ve just told you that it is pretty much impossible for this result to be real. Is this a completely artificial example? The answer is no. A result this extreme will happen about 1% of the time, and 1% is not zero. If you flip an ordinary-seeming coin 1000 times and get 540 heads, you have _not_ learned that your coin is strongly biased towards heads: 1000 flips isn’t nearly enough to quantify the bias of an ordinary-seeming coin.

]]> Freedman’s Rabbit Axioms

1. For the number of rabbits in a closed system to increase the system must contain at least 2 rabbits.

2. You cannot pull a rabbit from a hat unless at least one rabbit has previously been placed in the hat

3. Corollary: You cannot “borrow” a rabbit from an empty hat, even with a binding promise to return the rabbit later. NO NEGATIVE RABBITS

No, it’s not true, by definition or otherwise, that if you find a statistically significant comparison in a small sample that this is evidence that the sample size was sufficient to detect the effect? Not at all.

Here’s an example, one that I’ve used before: Suppose someone looks at data from a survey of 3000 people and estimates the difference in proportion of girl births, comparing children of beautiful to non-beautiful parents, and he finds a difference of 8 percentage points with a standard error of 3 percentage points, which is statistically significant at the conventional level. In real life, though, any population difference in these proportions cannot realistically be larger than 0.1 percentage points or so. N=3000 is simply not enough data to learn anything useful here. It’s the kangaroo problem. In this example, the sample size was not sufficient to detect the underlying comparison of interest, and that’s the case whether or not the particular sample happens to be statistically significant.

]]>Would you not agree that their is some truth to the claim that if you find a stat. sig. comparison in a small sample this is evidence that sample size was sufficient to detect the effect? Indeed, this is true by definition, no? I understand that you think this is harmful thinking because of the way research such as this is typically conducted (e.g. garden of forking paths). But if we assume that the research was done exactly as somebody would like if they were setting out to test a specific hypothesis (i.e. preregistered analysis) then I don’t really see the issue with this claim. As long as there is not selective reporting or some form of p-hacking, then I don’t see why it’s incorrect to say that statistical significance *means more* in small samples. Of course, all else equal, I’d prefer to have an estimate from a larger sample than a smaller one, as it conveys more information.

]]>“Certainly we knew before any data were collected that the null hypotheses being tested were false … The only question was whether or not the sample size was sufficient to detect the difference.”

Tables 1 and 2 are also relevant.

]]>