A graduate student in public health writes:

I have been asked to do the statistical analysis for a medical unit that is delivering a pilot study of a program to [details redacted to prevent identification]. They are using a prospective, nonrandomized, cohort-controlled trial study design.

The investigator thinks they can recruit only a small number of treatment and control cases, maybe less than 30 in total. After I told the Investigator that I cannot do anything statistically with a sample size that small, he responded that small sample sizes are common in this field, and he send me an example of analysis that someone had done on a similar study.

So he still wants me to come up with a statistical plan. Is it unethical for me to do anything other than descriptive statistics? I think he should just stick to qualitative research. But the study she mentions above has 40 subjects and apparently had enough power to detect some effects. This is a pilot study after all so the n does not have to be large. It’s not randomized though so I would think it would need a larger n because of the weak design.

My reply:

My first, general, recommendation is that it always makes sense to talk with any person as if he is completely ethical. If he is ethical, this is a good idea, and if he is not, you don’t want him to think you think badly of him. If you are worried about a serious ethical problem, you can ask about it by saying something like, “From the outside, this could look pretty bad. An outsider, seeing this plan, might think we are being dishonest etc. etc.” That way you can express this view without it being personal. And maybe your colleague has a good answer, which he can tell you.

To get to your specific question, there is really no such thing as a minimum acceptable sample size. You can get statistical significance with n=5 if your signal is strong enough.

Generally, though, the purpose of a pilot study is not to get statistical significance but rather to get experience with the intervention and the measurements. It’s ok to do a pilot analysis, recognizing that it probably won’t reach statistical significance. Also, regardless of sample size, qualitative analysis is appropriate and necessary in any pilot study.

Finally, of course they should not imply that they can collect a larger sample size than they can actually do.

>>>You can get statistical significance with n=5 if your signal is strong enough.

Dear Ann Landers,

I am concerned about the statistical validity of some published medical reports. How can I know when to trust a doctor's advice?

Rahul: I think the early trials of penicillin for septicaemia or tuberculosis would have been statistically significant with an N of something close to 5. (Get it, 100% don't die. Don't get it, 100% die).

This relates to something that came up on this blog before: http://www.stat.columbia.edu/~cook/movabletype/ar… – there weren't enough cases for statistical significance, but that doesn't matter, we still got a lot of information.

You coul conceivably compute the power of your test as a function of sample size and present the propective power analysis to others, so that one obtains a quantitative measure of how many subjects will suffice for any power. Of course, a "Bayesian" I don't think would do exactly this, but I imagine something like this?

I am not sure why making a 'potentially' bad argument would fall into the category of unethical.

Todd:

I'm confused about your statement. The word "potentially" does not appear above at all.

A famous example of a "significant" finding with a sample size of 4 on each arm is Fisher's cuppa tea test.

I took the comment "are there real-life examples where such small sample sizes (n=5) have yielded statistically significant results" as a challenge.

Let's ignore meta-analysis and cluster randomized trials as there would be some dispute about what the true sample size would be in those cases.

Here's a paper

http://www.biomedcentral.com/content/pdf/1471-212…

with the following statement:

"Western blot analysis followed by densitometry revealed a significant increase in the cleaved form of the autophagy marker myosin associated protein 1 light chain 3 (LC3) in STC2Tg pancreatic tissue (Figure 4A, B; n=4 animals p < 0.01) consistent with increased accumulation of ATF4 in STC2Tg mice"

but also note the later comment in the figure legend which cites an even smaller sample size:

"Representative Western blot analysis for LC3 I and LC3 II accumulation in wild type (WT) and STC2Tg pancreatic extracts revealed increased accumulation of cleaved LC3 II which was (B) quantified by densitometry (n=3 animals; *p<0.05)."

I did not want to take the time to resolve this discrepancy or to decipher whether this was 3/4 paired observations or 3/4 unpaired observations, which would produce a total sample size of 6/8. Still, I hope you take this as evidence that very small sample sizes have produced statistically significant results in the peer-reviewed literature.