## “The sample size is huge, so a p-value of 0.007 is not that impressive”

The above remark, which came in the midst of my discussion of an analysis of Iranian voting data, illustrates a gap—nay, a gulf—in understanding between statisticians and (many) nonstatisticians, one of whom commented :that my quote “makes it sound that [I] have not a shred of a clue what a p-value is.”

Perhaps it’s worth a few sentences of explanation.

It’s a commonplace among statisticians that a chi-squared test (and, really, any p-value) can be viewed as a crude measure of sample size: When sample size is small, it’s very difficult to get a rejection (that is, a p-value below 0.05), whereas when sample size is huge, just about anything will bag you a rejection. With large n, a smaller signal can be found amid the noise.

In general: small n, unlikely to get small p-values. Large n, likely to find something. Huge n, almost certain to find lots of small p-values.

The other piece of the story is that our models are just about always wrong. Rejection via a low p-value (with no other information) tells us that the model is wrong, which we already knew. The real question is how large the discrepancies are. P-values can be useful in giving a sense of uncertainty.

The only situation I can think of where the model holds up even when sample sizes are huge is the sex ratio. Under normal conditions, the sexes of births really are statistically independent with pretty much constant probabilities; that is, the binomial model holds. Even for n in the millions. See here for some calculations showing the good model fit, for N’s ranging from 4000, to 400,000, to 3 million. And it’s news when the data don’t fit the model. But the sex ratio is an unusual example in that way. Usually when you have large N, you’ll reject the model as a matter of course.

P.S. In the actual example, as Eduardo Leoni pointed out, I made a mistake, saying the sample size was “huge” when it was only 366 (more of a “large” or a “moderate,” I’d say). So my argument about the p-value doesn’t apply so perfectly to the Iranian election data. But this mistake doesn’t really affect my general point above.

1. James says:

Hi,

Amen on the P-value clarification!

The difference between Bayesians and Frequentists is that the former know that P-values (and significance tests in general) don't give you a whole pile of information about how important the effect is.

A good article for the, technically inclined, lay reader is linked below

2. BK says:

Not all tests are sensitive to N. Anything in the ANOVA family, is; there exist other setups that conclude with a lookup on the Chi-squared table that are not. t-tests are not N-sensitive—you can look up the formula and verify that as N gets large, numerator and denominator scale together.

You approximately say that here, but it's about the structure of the statistic, not the subject matter.

3. ChristianK says:

Running the same algorithm on a few other western election that we assume to be fair would give us a fine test.

4. ZBicyclist says:

@BK: t-tests are a charter member of the ANOVA family. F=t^2 for comparable degrees of freedom.

The separation is between tests of significance which are sensitive to sample size, and measures of relation (r-squared, eta, and so on) which aren't. I can't see that this divides Bayeseans and Frequentists.

5. Juliet says:

Consider testing that a regression coefficient equals 0. If there is truly no effect (beta=0), does having a large n still mean one is likely to find something (beta != 0)? Thanks!

6. Jim P says:

One way I like to convey this is a small multiples of a generic y vs x chart, varying p and r^2. A few data points pretty close to a straight line, high p, high r^2. A bunch of points not all that close to a line but with a definite trend, low p, low r^2.

7. Phil says:

Juliet: the true beta is never zero, for any coefficient you would bother to put in your model in the first place. It might be very, very small, but it is never zero. So having a large enough n means you will reject the beta=0 hypothesis.

8. Bill Jefferys says:

An extreme example of this is found in the following paper:

for which a (one-sided) p-value was computed of 0.00015 on over 100 million Bernoulli trials. The authors should have reported a two-sided p-value since a negative result would have been just as interesting as a positive one.

In my opinion the whole effort was an enormous waste of time and resources. As Andrew points out, it's almost certain that their model is wrong in some slight way.

I wrote a paper commenting on this one. Principally it was as an example of the Jeffreys-Lindley "Paradox", because looking at the same data from a Bayesian point of view one gets an entirely different result. Jim Berger has used my example in elementary lectures.

9. Juliet says:

Thanks for your input. I see this discussion occur across fields. I am not in social science, but it does seem that many/most effects are not zero, so significance testing is not as useful. In genetics, we often test if a genetic variant influences a trait. Most variants do not influence a trait, so significance testing is very common in this field because we just want to know if there is anything there. Would this influence your response to my original question? Thanks!

10. Amanda Owen says:

So can you then comment on this in the context of effect sizes (and relative measures of the value of large/medium/small effect sizes?)

Is there a way to talk about effect sizes relative to beta coefficients?

11. Anonymous Coward says:

Juliet: You're referring to the situation in which the null hypothesis really is true.

In this case, the null hypothesis is not true, but the question is what is the "real" significance of the result. In economics, we make the distinction between statistical significance and economic significance.

An example would be the effect of an increase in the minimum wage on unemployment. With a million observations, you are likely to find that an increase in the minimum wage has a statistically significant effect. But perhaps a \$1 increase in the minimum wage causes employment to fall by 0.0001%. The finding would be statistically significant, but not economically significant, as a 0.0001% decline in employment is trivial.

12. Dave says:

Any further refinement on the term "small n"?

n

13. Mike Rulle says:

Question

Am I safe to assume that what is being discussed here is the difference between the "statistical significance" of a low "p-value" and the importance and/or "magnitude" of the difference measured?

For example, when testing the "null hypothesis" that a coin is fair, 57 heads out of 100 gets a p-value of .1; but 570 out of 1000 gets a p-value of .00001 in rejecting the null.

If one had that knowledge in a gambling situation, I assume the low p-value is a good thing, correct?

14. PaulB says:

All things remaining constant (variance, effect size, distribution), a larger N increases the power to reject the null hypothesis in any statistical test. As in the economics example, in randomized clinical trials one must be certain that the expected effect size leads to a "clinically meaningful" difference between treated and placebo populations. If not, one can design a clinical trial with 80 or 90% power (depending upon how much money you have to spend) to demonstrate a statistically-significant, but completely unimportant effect.

The other, more disturbing misunderstanding amongst scientists is the notion that you can test all the comparisons in sight and declare a statistical difference thereby found as "significant" in any statistical OR meaningful way.

15. jonathan says:

16. jonathan says:

Forgot to mention how much I love the relation of this material to set theory – types, etc.

17. J Smith says:

@Juliet

By the very nature of the regression, if beta truly equals 0 then the chance of incorrectly finding an effect is the alpha level regardless of sample size. However as the sample size gets very large, neglible differences from 0 (e.g. 0.001) could result in very small p values.

18. Bill Jefferys says:

I've noticed that the article that I linked to on the Journal for Scientific Exploration is badly scanned.

Thanks, jonathan, for getting me to read the archived articles after almost 20 years. alphas were turned into a's, and lots of other errors. I don't know how they did this, but just scanning the pages for images would have been better.

My article as submitted (and basically published) is here:

Also, an article I published in response to a response to that article is here:

Bill

19. jonathan says:

Thank you for the additional paper. Using psychic claims to discuss Bayesian methods is really cool; it's rare to find such a perfect example.