Masanao sends this one in, under the heading, “another incident of misunderstood p-value”:
Warren Davies, a positive psychology MSc student at UEL, provides the latest in our ongoing series of guest features for students. Warren has just released a Psychology Study Guide, which covers information on statistics, research methods and study skills for psychology students.
Despite the myriad rules and procedures of science, some research findings are pure flukes. Perhaps you’re testing a new drug, and by chance alone, a large number of people spontaneously get better. The better your study is conducted, the lower the chance that your result was a fluke – but still, there is always a certain probability that it was.
Statistical significance testing gives you an idea of what this probability is.
In science we’re always testing hypotheses. We never conduct a study to ‘see what happens’, because there’s always at least one way to make any useless set of data look important. We take a risk; we put our idea on the line and expose it to potential refutation. Therefore, all statistical tests in psychology test the possibility that the hypothesis is correct, versus the possibility that it isn’t.
I like the BPS Research Digest, but one more item like this and I’ll have to take them off the blogroll. This is ridiculous! I don’t blame Warren Davies–it’s all-too-common for someone teaching statistics to (a) make a mistake and (b) not realize it. But I do blame the editors of the website for getting a non-expert to emit wrong information. One thing that any research psychologist should know is that statistics is tricky. I hate to see this sort of mistake (saying that statistical significance is a measure of the probability the null hypothesis is true) being given the official endorsement of British Psychological Society.
P.S. To any confused readers out there: The p-value is the probability of seeing something as extreme as the data or more so, if the null hypothesis were true. In social science (and I think in psychology as well), the null hypothesis is almost certainly false, false, false, and you don’t need a p-value to tell you this. The p-value tells you the extent to which a certain aspect of your data are consistent with the null hypothesis. A lack of rejection doesn’t tell you that the null hyp is likely true; rather, it tells you that you don’t have enough data to reject the null hyp. For more more more on this, see for example this paper with David Weakliem which was written for a nontechnical audience.
P.P.S. This “zombies” category is really coming in handy, huh?
I cannot believe they published that. I teach intro stats at a community college and not ONE of my students would make that mistake. They would make a whole raft of OTHER mistakes, but at least they understand what statistical significance does and doesn't tell them.
I am baffled; just what about the quoted text (I haven't read the piece, but you quoted some of the text, so the offending statement(s) must be in there) do you object to? I see no where that the author claims what you claim. Indeed, what is claimed is, albeit loosely-stated, pretty standard fare, even for you, and, makes a point: all hypotheses are necessarily subject to test. Now, granted, the distinction between substantive and statistical "hypotheses" is glossed, but so what? That is routine in this domain (and certainly on your blog). But we all know what is usually meant. So, just what is your objection here?
What's the history of confusion over p-values — does it go right back to the beginning? And what about the all-out assault on this confusion — does it also go way back?
It seems like the human mind, even one with high IQ, is not meant to understand what a p-value is. Our modern minds want so bad for social science to tell us whether our view of how human nature works is right or wrong.
It doesn't seem like our minds are merely mistaken when they learn about p-values — they seem biased in the direction of "p-value tells me how right my view is." We crave more certainty about whether we're right or wrong.
Looks like this confusion won't go away anytime soon.
FYI, cutting off half the 'o's and '!'s would cause this post to be much, much more readable in RSS readers. (Currently it spans two pages, meaning one must scroll left to right to read each line…)
"It seems like the human mind, even one with high IQ, is not meant to understand what a p-value is."
I think that this might be more an issue of training. In physics, my students all start with an Aristotlean theory of gravity (big objects fall faster) but eventually get quite good at reasoning correctly.
Perhaps the p-value is the same way, except that we don;t put the effort into carefully teaching it?
After all, it wasn't easy to get the proper definition of gravity to be accepted, maybe one day we'll see p-values in the same light?
John:
The following statement from the above quote is false: "all statistical tests in psychology test the possibility that the hypothesis is correct, versus the possibility that it isn't."
No. The p-value is not the probability (or "the possibility") that the hypothesis is correct.
This is a well-known mistake so I'm hardly shocked to see it expressed, but it's disappointing to see it on a reputable website that some people might treat as an authority.
Agreed, if that was what was meant: but a more plausible reading is the that the hypothesis at issue is not the statistical "hypothesis", but the substantive one.
Couldn't they at least look in a book?
Say, Everitt's Dictionary of Statistics.
They could even (gasp) look at Wikipedia, which gets it right, including that tricky bit about IF the null hypothesis is true.
http://en.wikipedia.org/wiki/P-value
John:
Nope. The p-value doesn't give you the probability that the substantive hypothesis is true, either.
The quoted passage nowhere mentions p-values. So its unfair to try and then convict this guy (in public) of mis-understanding them on the basis of it. He may or may not understand them.
As for the statement "all tests.." it may be a bit sloppy but I'm guessing he means that hypothesis testing is about testing a hypothesis against an alternative hypothesis. Its really no big deal.
Kevin:
When he writes "statistical significance testing," I think it's safe to assume that he's talking about p-values.
It's written by a masters student, for heaven's sake. So, he wrote 'possibility' instead of writing 'hypothesis'. Big deal. Note that he did not write 'test the probability the hypothesis is correct'—substituting probability and possibility give a very different meaning to that sentence. If you read the rest of the post, he provides a fairly good (if pointless) example that makes it clear he understands the meaning.
Agnostic: yes, it probably goes all the way back to Fisher. After all, it was he that coined the term likelihood in its non-intuitive sense: the likelihood for the parameters being the probability of the data given the parameters. Try explaining that likelihood is not probability to non-statisticians: they'll think you're a bampot.
Alex:
I understand the inclination to root for the underdog and to defend some innocent dude who I'm blasting on the blog. But the above passage is wrong, and it's presented as a helpful explanation! There are various correct explanations out there in lots of textbooks, so no need to muddy the waters with something wrong. I'm not saying that the author of the quoted passage is a bad guy, just that he was put in the position to do something he's not qualified to do.
As I wrote above, I don't blame the author–a lot of people make this sort of mistake! I blame the British Psychological Society for thinking that they can outsource statistical explanations to non-experts. I don't see the point in someone posting an explanatory note on an authoritative website and getting it wrong. In this case, they'd be better to just link to Wikipedia. The posted discussion of significance testing adds negative value.
Again, I don't blame the author–he didn't know better, and it's common enough for people to get overconfident about their statistical knowledge after teaching a couple courses. It's the fault of BPS for putting his piece out there.
@agnostic – I've read (and I forget where) that in Fisher's writings, he was clear what a p-value was, but he didn't emphasize the difference between what it is, and what we think it is. Guilford wrote the first book that was used by psychologists (Fundamentals of statistics in psychology and education) sometime in the 1940s, and got this wrong. Most other authors and instructors read Guilford, not Fisher, and also got it wrong.
I think it's confounded by the fact that psychology students are taught statistics by psychologists. And they tend to do their own statistics. Medics (by way of contrast) are taught by statisticians, and stats in medicine (to a larger extent) is done by statisticians, whose major was statistics, not medicine.
I've heard psychologists described (by other psychologists) as "a good statistician" – presumably because they took a couple of undergraduate courses and a couple of graduate courses, taught by other psychologists. I wonder what they'd think of a statistician who'd taken a similar number of psychology courses as "a good psychologist".
As Andrews says further up, statistics is hard. And people need to realize that.
Jeremy
Disclaimer: I'm a (quantitative) psychologist. Sometimes people call me a statistician. I don't always correct them.
Andrew: yes, I do like underdogs, but I don't see the problem with the statement you single out. 'Therefore, all statistical tests in psychology test the possibility that the hypothesis is correct, versus the possibility that it isn't.'
I think these would have been wrong variants of what was written:
statistical tests in psychology test the probability that the hypothesis is correct (meaningless)
statistical tests in psychology calculate the probability that the hypothesis is correct (incorrect)
I think these are correct and identical in meaning to what was written:
statistical tests in psychology test the statement/notion/postulate that the hypothesis is correct
Alex:
I disagree with his statement, "Statistical significance testing gives you an idea of what this probability [that your result was a fluke] is." I think this is a classic error. Statistical significance testing tells you the probability that, under the null hypothesis, you'd see something at least as extreme as the data. This is not the same–or even close to the same–as the probability that your result was a fluke (however that is interpreted).
Again, I don't mean to pick on this grad student. The problem is with the organization that presented him as a statistical expert.
Dear All,
I'm Christian Jarrett, editor of the BPS Research Digest. I've edited the post and added a note of clarification at the bottom: http://bps-research-digest.blogspot.com/2010/08/s…
I hope it's correct now, please let me know if not.
I should have taken more trouble to ensure this post was accurate – please accept my apologies, and let me assure you that guest posts of this nature will be more stringently reviewed in the future.
I realize that they get the definition of p-value incorrect, but if I were doing a tail-area test using a Bayesian posterior for the same problem then wouldn't this "p-value" would be the probability of the null being true?
p(mu>0|data) gives you the same numerical value as the equivalent p-value, given uniform prior for mu, yes?
the text was probably not referring to this, however, so there is a problem with their wording.
Andrew: I agree with your disagreement about the 'fluke' statement!
Now it reads
Therefore, all statistical tests in psychology test the probability of obtaining your given set of results (and all those that are even more extreme) if the hypothesis were incorrect – i.e. the null hypothesis were true.
I'm not sure this helps! Statistical tests in psychology (and elsewhere) test hypotheses. The means by which they do this testing can (but need not) involve the probability of seeing extreme data, under the null.
Yes, but it is trying to answer a different question. The usual question that users of hypothesis testing are trying to answer is whether the effect size is zero or non-zero, not whether the effect is greater than or less than some value. p(mu>0|data) gives you the latter, and for normal data the number that pops out of the Bayesian analysis will be the same as the one that pops out of the frequentist analysis (but the interpretation of the numbers is different). But there is no analogous Bayesian test for a frequentist two-sided test of a supposed point null hypothesis. That such tests are almost always silly (because point nulls are rarely, if ever exactly "true") is another question. As Herman Rubin has said many times, he doesn't need any data to know that a point null hypothesis is false. Jim Berger and his collaborators have discussed how such tests should be conducted; the idea is that you test an approximate null (Berger and Delampady say in their paper — see jstor.org — what their meaning of an approximate null is). You put a finite amount of prior probability on that approximate null and distribute the remaining prior probability on the alternatives. This leads to phenomena such as the Jeffreys-Lindley "paradox," where the result of the Bayesian test and the frequentist test can differ greatly.
Probably the best way to go is to turn to decision theory. Supposedly the reason you do hypothesis testing is to make decisions, which involves choosing the best action to take under the circumstances. I don't see how you can do this unless you have a loss function that says how bad the results might be if the action you take is not appropriate to the (unknown) state of nature that actually prevails.