## Why “bigger sample size” is not usually where it’s at.

Aidan O’Gara writes:

I realized when reading your JAMA chocolate study post that I don’t understand a very fundamental claim made by people who want better social science: Why do we need bigger sample sizes?

The p-value is always going to be 0.05, so a sample of 10 people is going to turn up a false positive for purely random reasons exactly as often as a sample of 1000: precisely 5% of the time. That’ll be increased if you have forking paths, bad experiment design, etc., but is there any reason to believe that those factors weigh more heavily in a small sample?

Let’s take the JAMA chocolate example. If this study is purely capturing noise, you’d need to run 20 experiments to get a statistically significant result like this. If they studied a million people, they’d also need only 20 experiments to get a false positive from noise alone. Let’s say they’re capturing not only noise but bad/malicious statistical design–degrees of freedom, manipulating the experiment. Is this any less common in studies of a million people? Why?

“We need bigger sample sizes” is something I’ve heard a million times, but I just realized I don’t get it. Thanks in advance for the explanation.

Sure, more data always helps, but I don’t typically argue that larger sample size is the most important thing. What I like to say is that we need better measurement.

If you’re measuring the wrong thing (as in those studies of ovulation and clothing and voting that got the dates of peak fertility wrong) or if your measurements are super noisy, then a large sample size won’t really help you: Increasing N will reduce variance but it won’t do anything about bias.

Regarding your question above: First, I doubt the study “is purely capturing noise.” There’s lots of variation out there, coming from many sources. My concern is not that these researchers are studying pure noise; rather, my concern is that the effects they’re studying are highly variable and context-dependent, and all this variation will make it hard to find any consistent patterns.

Also, in statistics we often talk about estimating the average treatment effect, but if the treatment effect depends on context, then there’s no universally-defined average to be estimated.

Finally, you write, “they’d also need only 20 experiments to get a false positive from noise alone.” Sure, but I don’t think anybody does 20 experiments and just publishes one of these. What you should really do is publish all 20 experiments, or, better still, analyze the data from all 20 together. But, again, if your measurements are too variable, it won’t matter anyway.

1. RJB says:

A little off topic, but not much: from https://www.nber.org/papers/w26480.pdf

Estimates of teacher “value-added” suggest teachers vary substantially in their ability to promote student learning. Prompted by this finding, many states and school districts have adopted valueadded measures as indicators of teacher job performance. In this paper, we conduct a new test of the validity of value-added models. Using administrative student data from New York City, we apply commonly estimated value-added models to an outcome teachers cannot plausibly affect: student height. We find the standard deviation of teacher effects on height is nearly as large as that for math and reading achievement, raising obvious questions about validity. Subsequent analysis finds these “effects” are largely spurious variation (noise), rather than bias resulting from sorting on unobserved factors related to achievement. Given the difficulty of differentiating signal from noise in real-world teacher effect estimates, this paper serves as a cautionary tale for their use in practice.

• Ben says:

That’s a fun blurb

• jim says:

That’s a great comment, not off topic at all. Relates exactly to what Andrew, I believe, was implying about noise and measurement:

“the effects they’re studying are highly variable and context-dependent, and all this variation will make it hard to find any consistent patterns.”

That applies exactly to measuring teacher impact, because it’s at least plausible and probably true that “teacher value added” is one of the smaller magnitude effects in student learning. So unless one controls for the larger magnitude effects – e.g., parental help, student raw intelligence, student effort – there’s no way in hell that a consistent pattern of “teacher value added” will emerge from studying student achievement.

• Carlos Ungil says:

See also “Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction”

2. Jonathan (another one) says:

The smaller the sample size, the larger the effect you need to be “statistically significant.” Spurious *large* effects are more noteworthy than spurious *small* effects. The desire for large sample sizes comes not from some effect of sample size on Type I error — it comes from an attempt to reduce the presence of large effects which are spurious, even as you raise the possibility of “statistically significant” small effects. If the effect stays large as the sample size increases, “statistical significance” becomes essentially irrelevant. If it falls as sample size increases the probability that you just got lucky in the early, smaller studies (ie that you’re chasing noise) becomes more probable.

• Sort of. most of these studies act as if they were random sampling from a population. They aren’t. For example the census bureau does the ACS with a random sample of a population. The population is well defined… housing units in the US. On the other hand, take say the ovulation and voting example… the population of interest is pre menopausal women eligible to vote. Almost all of these women have zero chance to be included in the study… so there is no random sample of the population. Continuing to make bigger and bigger samples of the odd and poorly defined and even time varying sub population will not cause any of the nice random sampling asymptotics to come into play… you’re still likely to get a spurious signal which is meaningless

• Rick G says:

Yeah, but getting rid of estimates that are spuriously large for random reasons (even if estimates that are spuriously large for systematic reasons) would still help a lot. And increasing sample size requirements would help get rid of those randomly-spurious inflated estimates.

3. Matt Skaggs says:

“‘We need bigger sample sizes’ is something I’ve heard a million times, but I just realized I don’t get it. Thanks in advance for the explanation.”

I think the answer is buried somewhere in the rambling response:

“…the effects they’re studying are highly variable and context-dependent, and all this variation will make it hard to find any consistent patterns”. [So by increasing N, you can] “reduce variance” [and context-dependency] even if “it won’t do anything about bias.”

4. Carlos Ungil says:

> If this study is purely capturing noise, you’d need to run 20 experiments to get a statistically significant result like this.

If you know that there is only noise you’re right: there is no point in increasing the sample size. There is no point in doing the study at all.

But if there is something else increasing the sample size improves the signal-to-noise ratio. If the sample size is too small, even when there is an interesting effect to be found, you may need to run 19 experiments to get a statistically significant result. That’s why you design experiments to have an acceptable level of “statistical power”.

• This is true provided you are randomly sampling from a well defined population. But in many of these social science or medical examples this is far from true. Or to the extent it’s true it’s random sampling from a small sub-population whose relevance is not clear (for example college students at a particular university, or patients who happen to come to a particular clinic)

• Carlos Ungil says:

I think that’s a different issue, given that the question seems to be just why a larger sample size is better in principle. But I agree that if we care about dog metabolism adding mice to the study doesn’t help much.

An addition to my previous comment: even if there is just noise a larger sample size is better. We still get 1/20 significant results but the estimated effects will be lower so there may be less risk of getting carried away… if we look at the results and not just at the fact that p<0.05. But I agree that the only way to improve some studies is to just not do them.

• My point is that larger sample size isn’t even in principle true unless you are doing relatively high quality random sampling from a population that’s actually of interest. So, to take an extreme example, suppose you want to treat children in Africa for worms, and you start by studying children in the state of Georgia in the US, because you’re a researcher in Georgia and it’s relatively inexpensive and accessible to get children from around your home county.

A big old sample will get you nowhere, because the health challenges faced by children in Georgia are nothing like the health challenges of children in Congo or whatever.

This seems obvious when you put it this way, because everyone can see that being a child in Congo is probably different from being a child in Georgia. But this stuff is done *all the time*. Imagine you’re a surgeon in southern california looking to do studies of stents… now imagine you’re a surgeon in Missouri… now imagine you’re a surgeon in London… in Sweden… in France these populations have large differences and yet it’s routine to see studies done in one local claiming to generalize far wider than you could justify by random sampling theory.

• Carlos Ungil says:

> from a population that’s actually of interest

I get it. Dogs, mice.

• Yes, of course I expect *you* to get it Carlos ;-) I guess I’m extrapolating to additional readers, and wanted to clarify that it’s not just about extrapolating animal studies to different species. You often hear about people bringing up the fact that a study was done in mice and might not extrapolate to humans that well for example.

That’s a true issue, but even if you’re trying to find out how mice behave, there are big differences between strains, between colony conditions, between methods of experimentation… My wife and I once figured out why mice were dying as soon as people took them out of the trailer facility where they were being temporarily housed… After 5 or 6 hours of late night experimentation we discovered that swiss black juvenile mice have fatal audiogenic seizures when exposed to ultrasound coming from the motion detector light switches….

Things vary all over the place, and trivial applications of frequentist mathematics in which everything is treated as a random sample, and more data = smaller standard error doesn’t correspond to reality in any way for large swaths of studies.

• matt says:

most people understands this, Daniel. It’s called external validity. You have this weird habit of articulating very simple problems / claims in a basically uninterpretable fashion. Usually it involves the use of the phrase ‘random number generator’ although I don’t think that came up here, thankfully.

• Matt, your concept of most people is different than mine, perhaps you mean most PhDs in econometrics. it certainly isn’t true for most biomedical researchers or most psychologists who use panel after panel of undergrad volunteers or most any of the research highlighted on this blog time after time.

Andrew has a tendency to use terms like effects are variable and context dependent…. but he doesn’t define what this means. The average effect on testosterone production of a particular kind of power posing in the population of Americans on this day is a well defined single number, not a variable at all, albeit that we have no hope of measuring. I know what he means, but people routinely publish studies on say educational methods or health policy interventions or whatever, claiming either explicitly or implicitly, generalization of their results and when they have large sample sizes they publish small standard errors and claim their studies are very good and precise… none of which is true for reasons discussed here all the time

• it’s especially not true for most consumers of science journalism.

• Matt Skaggs says:

OK, I think the blog can do this. I know, that’s crazy. Ima’ incorporate Carlos’ term “signal to noise.” Here is something that answers the question that was posed:

In many datasets, the effects are highly variable and context-dependent. Increasing N will not reduce bias caused by problems in your model, but if your ability to extract useful information is limited by signal to noise, you can improve the dataset with a larger N.

Maybe this could be improved. Wouldn’t this blog be cool if all these smart people actually addressed the swell topics that are raised?

5. Wonks Anonymous says:

In your original post on the time-reversal heuristic, you noted the small sample size as one thing that made the initial study less reliable than the later replication.

• Andrew says:

Wonks:

Sure, there’s a benefit from increasing sample size. There’s a reason that when we do opinion polls, we do N=1000, not N=100 or N=10.

As I wrote above, more data always helps, but I don’t typically argue that larger sample size is the most important thing. What I like to say is that we need better measurement. If your measurements are basically OK, then, yeah, time to increase the sample size, roughly to the point where the residual standard error is of the same order of magnitude as the bias.

6. Anonymous says:

“Significant” results from small studies can only be large, too large to be true. Could be type 1 errors, or overestimates of real effects (M errors?), but not to be trusted either way. Eg treatment effects from trials that have been stopped after a planned interim analysis for effectiveness overestimate effects obtained from trials of the same interventions that went to completion (the ratio of relative hazards is something like 0.7).

7. Thomas says:

“Significant” results from small studies can only be large, too large to be true. Could be type 1 errors, or overestimates of real effects (M errors?), but not to be trusted either way. Eg treatment effects from trials that have been stopped after a planned interim analysis for effectiveness overestimate effects obtained from trials of the same interventions that went to completion (the ratio of relative hazards is something like 0.7).

Athough occasionnally one can hit the unexpected jackpot (eg Vitamin A supplements and their effect on child mortality)

8. Austin Fournier says:

Sure, more sample size isn’t going to change the percentage of false hypotheses that turn into false positives, but it is expected to change the percentage of positive results that are actually false positives (plug statistical power and your chosen alpha for the significance test into Bayes Theorem and the difference should be apparent). I would hazard a guess that most researchers are more concerned with Prob(False Positive | Positive) than Prob(False Positive | False). For this reason, the preoccupation with sample size makes sense to me from a hypothesis-testing standpoint.

• Andrew says:

Austin:

As I wrote above: My concern is that the effects they’re studying are highly variable and context-dependent, and all this variation will make it hard to find any consistent patterns. Also, in statistics we often talk about estimating the average treatment effect, but if the treatment effect depends on context, then there’s no universally-defined average to be estimated.

Given all that, I don’t find the false-positive, false-negative framework to be helpful.

• In most situations there doesn’t even exist a meaningful definition of “false positive”. Andrew’s concept of type S and type M errors is much more reasonable. Larger sample sizes make the magnitude estimates more precise, but only for the group being measured, since as Andrew says lots of effects are highly variable and context-dependent… the relevance of a more precisely estimated effect that only exists in a particular sub-population that can’t even be well identified, and in a context that no-one even knows how to reproduce… is basically zilch.

In the example I mentioned above, my wife and I found that reliably 100% of the time that you removed swiss black juvenile mice from their trailer, they would run around like crazy, flop around on the bottom of the cage, and suffocate from lack of breathing and die…

That’s a pretty damn strong effect. However it required *swiss black* mice of a *certain age* being taken out of a trailer into a vestibule with an *ultrasound based motion detector* using a particular range of ultrasound frequencies, with the light switch in a particular position, power turned on, in a cage that wasn’t soundproofed… etc etc etc. a very very specific context and population.

If we were typical scientists of the type featured on this blog we could have run a PNAS article claiming “transporting mice through doorways causes instant death p = 2.4e-17” making bold generalized claims supported by tiny p values and whatnot.

sure people could have countered with lack of replication “we walked through a door carrying a mouse and found no effect whatsoever” but since our paper was published first… no one would ever be able to unseat it, and decades from now we’d be talking about the dangers of transporting mice through doorways.

Much of the stuff featured here looks basically like that, except way less reliable or dramatic.

9. jim says:

Bob Dylan foresaw this problem way back in ’68:

You thought you’d increase your sample size and that
would fix your problems bit your study didn’t repli-cat
Ain’t it hard when you discover that
bigger n really isn’t where it’s at
now you got a paper that you gotta repeal,

Oh how does it feel?
How does it feel?
it’s too late to bemoan
You got too many unknowns
your out of the publication zone
like a strawberry scone

10. Andre says:

Yo prof Gelman, the language generator is on point here. But really, you have to be careful about your target audience is. You’re better off typing simplified chinese. The group you’re appealing to are not likely statistics students.

11. Harlan Campbell says:

In 2014-2015 a number of psychology journals adopted policies prioritizing studies with larger sample sizes in order to incentivize researchers to conduct higher-powered studies. While this strategy did motivate researchers to adopt larger sample sizes, there were also unintended consequences. Sassenberg et al. (2019) conclude that in the years following the policy change “the demand for higher statistical power […] evoked strategic responses among researchers. […] Researchers used less costly means of data collection, namely, more online studies and less effortful measures.”