# Misunderstanding the p-value

The New York Times has a feature in its Tuesday science section, Take a Number, to which I occasionally contribute (see here and here).

Today’s column, by Nicholas Balakar, is in error. The column begins:

When medical researchers report their findings, they need to know whether their result is a real effect of what they are testing, or just a random occurrence. To figure this out, they most commonly use the p-value.

This is wrong on two counts. First, whatever researchers might feel, this is something they’ll never know. Second, results are a combination of real effects and chance, it’s not either/or.

Perhaps the above is a forgivable simplification, but I don’t think so; I think it’s a simplification that destroys the reason for writing the article in the first place. But in any case I think there’s no excuse for this, later on:

By convention, a p-value higher than 0.05 usually indicates that the results of the study, however good or bad, were probably due only to chance.

This is the old, old error of confusing p(A|B) with p(B|A). I’m too rushed right now to explain this one, but it’s in just about every introductory statistics textbook ever written. For more on the topic, I recommend my recent paper, P Values and Statistical Practice, which begins:

The casual view of the P value as posterior probability of the truth of the null hypothesis is false and not even close to valid under any reasonable model, yet this misunderstanding persists even in high-stakes settings (as discussed, for example, by Greenland in 2011). The formal view of the P value as a probability conditional on the null is mathematically correct but typically irrelevant to research goals (hence, the popularity of alternative—if wrong—interpretations). . . .

I can’t get too annoyed at science writer Bakalar for garbling the point—it confuses lots and lots of people—but, still, I hate to see this error in the newspaper.

On the plus side, if a newspaper column runs 20 times, I guess it’s ok for it to be wrong once—we still have 95% confidence in it, right?

P.S. Various commenters remark that it’s not so easy to define p-values accurately. I agree, and I think it’s for reasons described in my quote immediately above: the formal view of the p-value is mathematically correct but typically irrelevant to research goals.

P.P.S. Phil nails it:

The p-value does not tell you if the result was due to chance. It tells you whether the results are consistent with being due to chance. That is not the same thing at all.

## 162 thoughts on “Misunderstanding the p-value”

• Agreed. I think Balakar did an OK job, at least looking at his target audience.

Andrew’s criticisms are valid for a more sophisticated audience with some grounding in statistics. In this particular context though, Andrew’s criticism comes across as nitpicking.

Can anyone (Andrew?) come up with a better way to communicate this to a lay audience in the ~60 or so words that Balakar devoted to it.

Ironically, the alternative quote (from Andrew’s paper) would be absolutely incomprehensible to the average NYT reader IMHO.

• Rahul:

Of course my article is not intended for the average reader of the NYT.

It’s not at all “ironic” that a paper I wrote for the journal Epidemiology would not be clear to the average NYT reader. I posted it here because it’s relevant for the readers of this blog.

And, no, I don’t think it’s “nitpicking.” I don’t think there’s any point for an expository article to contain statements that are flat-out false and are also misleading. If a given writer can’t do it in the allocated space, I’d suggest running a different story. There’s lots of science news to be reported, and I don’t think anything is gained by passing around an error.

Again, I’m not blaming anybody here, exactly. It’s a common misconception that’s being reported. I just think the point of such an article should be to clarify the misconception if possible, not to repeat it.

• A reader who has never heard of the p-value before needs a gentle introduction. Sometimes a statement is deliberately left with a flaw or imprecision in the interest of ease of understanding.

e.g. If a Middle School Physics Teacher does not qualify his simplistic explanations of “mass” to include relativistic variations he can hardly be accused of “passing around an error” or uttering “flat-out-falsehoods”.

Abstractions are built in stages. “Lying a little” and then chipping away at the lie at the right time is not a bad pedagogical strategy.

IMHO, giving lay-audiences a substantially (though not entirely) correct picture of p-values is a better outcome than not covering it at all (in the fear that one couldn’t do it exactly right).

• Rahul:

I agree with you in general. But in this case I don’t see the article as giving much of a useful picture. Nor is a 400-word article in the newspaper necessarily the right place for such a tutorial. I think it works better to put something more newsy in the newspaper.

• If I were to attempt to explain p-values to a lay audience in a one-off article, I would be vague — appropriately so, I hope. Something like this:

A p-value indicates how unusual a result would be, if it were only a chance occurrence.

I would not even try to relate p-values to results that did not happen. Not in a one-off article. It appears that Balakar did not do that, either. (The link is dead, BTW.)

• give me a break
there are two possibilities
p values aren’t really that hard; it is just that the statistics teachers are doing a lousy job (one can blame the student if one student doesn’t get it; if all the students don’t get it, then it is the fault of the teachers)
or
p values are , like quantum mechanics or general relativity, really hard
In which case, why on earth do we teach them to people ?
Would we really blame the avg non specialist if they didn’t get the nuances of general relativity after one or two semesters ?

1. I’m confused about your objection. Certainly you would give the author a good deal of leeway when he starts off the sentence “By convention…” since it certainly is the convention that a researcher accepts the null hypothesis if p >0.05. You can disagree with the method, but would you disagree that that is the convention?

Or do you disagree with his gloss “the results of the study, however good or bad, were probably due only to chance”? That is a certainly garbled description, but it doesn’t seem to be getting the conditional probabilities wrong.

I have seen much worse descriptions of p-values.

• I think Andrew is irked by the second part, and it does get the conditioning wrong:

compare:

“the results of the study, however good or bad, were probably due only to chance”

to:

“the results of the study, however good or bad, might have been caused only by random chance”

From a conditioning perspective:

P(*only* chance is operating | p > 0.05)

(I take it that in *most* areas of research this is a very small number, when p > 0.05 it indicates that whatever effect you’re expecting is smaller than the overall effect of chance, but it’s not zero)

vs

P(p > 0.05 | *only* chance is operating)

(if you KNEW that *only* chance was operating, and you have the correct random model, then this number should be 0.95)

• DC:

What Daniel said. “By convention, a p-value higher than 0.05 usually indicates that the researcher rejects the null hypothesis” would be much much better than “By convention, a p-value higher than 0.05 usually indicates that the results of the study were probably due only to chance.”

“Probably due only to chance” is just wrong. “Rejects the null hypothesis” is a description of behavior.

Also, you write, “I have seen much worse descriptions of p-values.” But “I have seen much worse” is a pretty low standard. This is the New York Times we’re talking about. They’re the best newspaper! When I write my columns for them, I do my best, I’m not satisfied with “it could’ve been worse.”

• Not only is that incomprehensible, as Rahul says, but it is also incorrect. IP has the more correct (and more incomprehensible) statement below.

Perhaps it would be better to say: By convention, journal editors reject papers unless they report a p-value less than 0.05.

• Roger:

Yes, your last sentence sounds reasonable. But I think my earlier statement, “By convention, a p-value higher than 0.05 usually indicates that the researcher rejects the null hypothesis” is correct, and I have no idea why you would think otherwise. The inserted word “usually” confuses things a bit; I just threw it in to be parallel with the sentence from the original article.

• Andrew:

Let me try to rephrase: “By convention, a p-value higher than 0.05 usually indicates that the researcher has to go on specification searching until p<=0.05. If these efforts fail then the manuscript is left in the drawer and forgotten.”

• The null hypothesis is usually what the researcher wants to disprove, so he wants a p-value less than 0.05 to reject it. So if the p-value is greater than 0.05 then he fails to reject the null hypothesis. The data could be a chance deviation from the null hypothesis, but he does not necessarily accept the null hypothesis. I was just agreeing with what IP said below. But “fail to reject the null” is a confusing triple-negative. Maybe it is better to just say that the p-value test is a just a rule-of-thumb that is very popular in some fields.

• When p > the selected level of significance, you fail to reject the null that the data is due to chance (if this is the null), but you do not accept it.

• Hilary:

You wrote: “in a clinical trial for a new drug, the drug must be assumed ineffective until ‘proven’ effective.” I understand what you mean from a sociological perspective, but that sentence of yours bothers me in that I see your “must” as having normative content, and I certainly don’t believe it’s appropriate to think that a new drug must be assumed ineffective until proven otherwise. That seems like a bad attitude to me. Again, I recognize that this is a standard attitude embodied in many procedures and regulations, but I still don’t like it!

• That’s fair enough. I was trying to describe the status quo in clinical research, but I didn’t consider that it might sounds like like I was taking a stand on current clinical trial standards.

2. He could have gone a long way towards correctness without seeming pedantic if he’d said:

“By convention, a p-value higher than 0.05 usually indicates that the results of the study, however good or bad, were not reliably different enough from random chance.”

3. So your complaint is that it is better to say “the casual view” than to say “by convention”.

• Roger:

See my comment above.

4. Isn’t Andrew’s point more fundamental (i.e. not just about clear language)? The problem lies in applying p-values to the explanation, rather than to the data.

The p-value does not tell you if the result was due to chance. It tells you whether the results are _consistent_ with being due to chance. That is not the same thing at all.

I have a two-headed (actually, two-tailed) coin. If I flip it three, it comes up tails all three times. That is consistent with the hypothesis that it is a normal coin. But it is not a normal coin. “The result is consistent with chance” is not the same as “the result is due to chance.” It just isn’t. But people often act as if it is, and that’s bad.

• Part of the confusion also stems from referring to “by chance” as if there’s a single definition of what “by chance” means when there’s not. Some medical researchers mistakenly think that the whole point of statistics is to tell you the right definition of “by chance” based on whether your data is continuous/discrete/categorical/etc. Then they mistakenly partition reality into 2 possibilities 1) either the mechanism implied by their comparison populations is correct or 2) the null structure they’ve imposed is true. This leads to a lot of downstream misconceptions.

• Yes. It’s like when trials say “not guilty” not “innocent”

• @Peter… huh? “Not guilty” is the same as “innocent”. When found not guilty you are acquitted and, at least formally, leave the courtroom without a stain on your character, etc etc.

Use of the p-value (alone) to determine significance/non-significance,is more like deciding between “guilty” – with the added statistical idea that we don’t say this wrongly in more than proportion alpha of cases – versus declaring a mistrial. The “guilty” verdict is analogous to rejecting the null. In the mistrial verdict, no-one’s declared guilty, or not guilty, or innocent; we don’t say anything about hypotheses being true or not.

• In terms of law, yes, “not guilty” == “innoncent”. But not in terms of logic; you cannot prove a negative. What you’re really saying is that you’ve failed to prove a positive (beyond reasonable doubt, even). Maybe this is obscured by all the “double jeopardy”-restraints etc.

• Andreas; actually I was thinking of it the other way round. Guilty=positive result=null hypothesis rejected; Mistrial=no positive result and no negative result either = nothing said about hypotheses.

I realize this leaves no space for “not guilty”=negative result=accept null hypothesis. But in statistical work, to go supporting null hypotheses (sensibly) we need information other than just knowing a valid p-value, so this is okay.

Couple of caveats; by “valid” here I mean just one that is uniform(0,1) under the null. Also, it may or may not be sensible to insist on Type I error rate control, depending on the situation. In many situations I encounter, testing is not particularly relevant.

• The criminal verdict analogy is helpful in some ways, but I have two problems with it. First, it neglects Andrew’s point about confusing p(A|B) with p(B|A). Second, significance tests and confidence intervals measure only the uncertainty due to random sampling or random assignment, but in practice, we usually have many other reasons for uncertainty (e.g., possible biases in observational studies). I think of statistical significance not as a guilty verdict against H0, but as a *minimal* standard for evidence against H0 to be presented to a jury. Lack of statistical significance is a good reason to say “Don’t jump up and down,” but statistical significance by itself is not enough reason to say “Jump up and down.”

I used this analogy in an intro stats class: The prosecution could call a witness who says, “So-and-so told me Winston’s a serial killer.” (One student let out a rich, hearty laugh, but no one else did.) But that would be hearsay, which is not admissible in a court of law. So the judge would say “Strike that from the record” and instruct the jury to pretend they never heard that testimony. That’s failure to reject H0. Rejecting H0 should just mean the evidence is allowed in court. A verdict should only be reached after weighing different pieces of evidence.

(Of course, using statistical significance as a threshold for evidence may not always be best for discovering the truth. But the same can be said of our legal system.)

The value of significance tests, to the extent that there is any, is as a restraining device. David Aldous puts it pithily in his mini-review of Ziliak and McCloskey:

http://www.stat.berkeley.edu/~aldous/157/Books/stat.html

• Re: confusing p(A|B) and p(B|A) – yes it neglects this but it’s not hard to extend the example. In trials where, actually, only innocent people were tried, the proportion of actually-innocent among those we convicted is 1. Flip the example around and you get proportion zero. We see that, if you don’t know which is the case (or are unwilling to say) the procedure gives you no indication of what that proportion is. (nb less extreme examples might be better in a class, of course)

Re: the analogy. I don’t think being struck from the record is so helpful. Evidence should be (and hopefully is) presented that supports you not being a killer, not just that’s unusual if you are a killer. I’m also uncomfortable about statistical significance being a threshold for presenting results, versus not presenting them; if one must do testing it hopefully comes at the end of the analysis, not at some interim step – e.g. it’s not a great idea to test your way to the “right” analysis.

• Thanks for your comment, George. I agree with everything in your 2nd paragraph. Whether a study gets published shouldn’t depend on whether the result is statistically significant.

The “strike that from the record” part of my analogy only means that when people ignore published results that aren’t stat sig, that’s like instructing the jury to disregard a witness’s testimony. I didn’t mean to promote this as the ideal paradigm; I meant to reveal what a weak paradigm it is. My larger point is: “Lack of statistical significance is a good reason to say ‘Don’t jump up and down,’ but statistical significance by itself is not enough reason to say ‘Jump up and down.'”

By “weighing different pieces of evidence”, I meant weighing different studies.

• Phil, that’s a nice summary and a good example. But to play devil’s advocate, someone could say: Sure, but if you flipped your coins many more times it would not be consistent with it being a normal (fair) coin. The p-value would be extremely low, it would not be consistent with the null random model, and thus I conclude naively that the p-value being only consistent with the null model is correlated enough with the posterior probability of the null model in most cases that we can gloss over this technicality. Do you have simple, compelling counterexamples to this? There are nice conceptual ones in Cohen’s classic http://ist-socrates.berkeley.edu/~maccoun/PP279_Cohen1.pdf but they often come off as non-realistic/contrived to my non-statistical colleagues. And they are difficult to explain.

• Indeed, since the “result”, whatever it is, is due to random processes, it is *always* due to chance. That probability is 1, regardless of the observed p-value.

Better to say with Daniel Lakeland (above) that *only* chance is involved (and the null hypothesis is actually true). I like his comment about conditioning. I think it makes it very clear. Here’s the link again:

http://statmodeling.stat.columbia.edu/2013/03/12/misunderstanding-the-p-value/#comment-143462

6. I think your PS is spot-on. I actually think this is a much bigger problem than bad science journalism is. I think it is a flawed inferential process in multiple branches of science (including the two I’ve called home). Pretty hard to get great journalism out of bad primary source documents. Not impossible, mind, but with the state of journalism in general now, it might as well be.

7. I don’t think it’s too hard to communicate what the p-value is.

VERY nontechical: Well, you know the old saying that if you fling enough poop at a wall, some will stick? The p value is the proportion that we won’t clean off.

Somewhat nontechnical: If there is really no effect, how likely are results like these? (Not perfectly correct, I know, but in the ballpark)

Not too technical: If, in the population from which this sample was drawn, there really was no effect at all, how likely is it, in a sample of the size of this one, that we will get a test statistic at least as far from 0 as we got?

8. One of the main sources of confusion about the p-value is the supposition that it is a conditional probability in the first place. It is not! It is the probability of D > d for a relevant distance measure D, computed under the assumption that the data arose according to the distribution in the null hypothesis. If computed correctly, and not just as an “isolated record” the generation of small p-values warrants inferring a genuine discrepancy from the null (the type and magnitude of which may be determined). Since “H: there is a genuine discrepancy” passes with severity, one may infer H. What I think happens next involves an informal use of the word “probability” in English, which is NOT akin to the formal Bayesian computation. One may say of a claim that has passed with high severity or one that is highly corroborated that “it’s probably true”. Then the Bayesians jump up and down declaring misinterpretation!

• … so that’s p=Pr(D > than d | null hypothesis holds for the process generating D). How then is p not a conditional probability?

Perhaps, in some non-standard way, you are reserving the term “conditional probability” to mean something more restricted than just Pr(event|condition)? (If so, what?)

NB your notion of being “computed correctly” is not obvious either. Difficulties of misinterpretation occur when the p-values we use in practice have exactly their advertised properties.

• Mayo:

You write, “One of the main sources of confusion about the p-value is the supposition that it is a conditional probability in the first place. It is not.”

But the p-value is a conditional probability. It is Pr(T(y.rep)>y|H) or, in words, the probability that future data will be more extreme than current data, conditional on the hypothesis being true.

• No, it’s not, the required joint and prior are absent. That is why people like David Freedman introduced the “||” double bar. Can’t send a reference now, traveling.

• I don’t see the need to introduce new notation, given that the existing notation is unambiguous.

• Steffen Lauritzen introduced a double bar to represent intervention conditioning, equivalent to the do() notation of Pearl… which can’t be what you mean. Look forward to seeing which new notational overlap statisticians will have to cope with.

• And of course I don’t think it’s appropriate to have a notation for intervention conditioning, given that different interventions will have different effects (as discussed various times on this blog).

• Andrew: could you link to some of the discussions you are referring to? Is this a real disagreement between yourself and Pearl?

I have no idea if I have a real disagreement with Pearl. He and I seem to work on different problems.

• Andrew: Thanks for the link – but your argument in that post is only that a particular model is an oversimplification of reality. Are you making the strong claim that intervention conditioning doesn’t make sense _in principle_?

The point is that “intervening” on a variable x is not in general uniquely defined. Different interventions that affect x can have different effects on other variables in the system. I think that studying interventions is very important; I just don’t think it makes sense in general to take variables in a system and define an (implicitly unique) effect of changing them, if there can be more than one way to make the change.

• Andrew: so you are saying that the notion of an idealized intervention (i.e. one for which any potential effect on other variables in the model can be ignored) cannot be useful? Or are you just saying that the notion is irrelevant to social science because you cannot envisage such interventions in that subset of application domains?

I think the notion of an idealized intervention makes sense in some contexts but not others. It depends on whether the various ways of altering a variable x will have different effects on other variables y. In some settings, I think it can make sense to consider an idealized intervention on x, in other settings not.

• Mayo: Conditional probability is just conditional probability. No joint or prior are needed. This is basic probability theory.

• What Mayo means is this: P(A|B)=P(A,B)/P(B). In the frequentist framework, you cannot condition on something that is not a “random variable” because that would imply willingness to write a joint distribution involving it. Yes, it’s basic probability theory, but there are two basic probability theories out there and despite sharing the same name they’re different.

• I agree with Deborah.
There is no conditioning.
For conditioning, you need to condition on something random.
From the frequentist perspective, H0 is fixed, not, random.
It makes no sense to talk about conditioning.

An analogy: in frequentist inference one writes p(x; theta)
not p(x|theta). The reason is that, in frequentist inference, theta is not
a random variable so you can’t condition on it. There is no prior on theta
and there is no joint distribution for X and theta.

Of course, you could insist that, in a Bayesian framework, H0 (or theta) are random.
But forcing the frequentist calculation into a Bayesian framework is not helpful; it only

So, to repeat: Deborah Mayo is 100 percent correct. There is no conditioning involved in
a pvalue (or a confidence interval)

–Larry

• Larry:

You can say it all you want, but it still seems silly to introduce a new notation “;” which means “conditional on,” given that we already have “|”. But it all depends on context. For me, it’s simpler to use “|” for conditioning, whether or not a probability distribution is assigned to the thing on the right side of the expression. You and Mayo prefer to use two different symbols for the two situations. Really I think either notation is OK because either is unambiguous.

• Larry, H0 may not be random now, but what if the physicists are right about there being multiple universes? Then H0 would be a random variable all of a sudden. Everyone would have to redo the notation in their textbooks.

In all seriousness, if anyone wanted a new notation which would greatly reduce confusion, then there’s much better candidates than this asinine debate over “|” vs “;”. How about using a different notion for frequency distributions and probability distributions?

For example, if we’re working with a sequence of coin flips such as: HHTHTHTT

one can say Fr(H)=.5 or f=.5. which should keep the Frequentists perfectly happy. Then when a Bayesian to deals with probabilities they use Pr() or p. For example, Pr(H on first flip)=.99. Pr != Fr in general but so what? Frequentists could refuse condition their Fr() on anything and Bayesians can always condition their Pr(). Everyone’s happy.

• @Entsophy: a fine suggestion that would prevent a huge amount of confusion, but we all know it will not be implemented. Neither group is prepared to back off and leave the word “probability” to the other group. So both groups will continue to create confusion by using “probability” to refer to completely distinct concepts.

This is understandable: both groups think that _their_ notion of probability is more fundamental, and that it would be a step backwards (for science and also for the public understanding of science) to cede usage of the most central word in statistics to a less fundamental concept. So we have no choice but to fight this battle the old-fashioned way: do another century’s worth of science and see which idea wins out in the end.

• A random variable is still a random variable even if its value is certain. In that case, its probability is 1. Unless you are willing to stipulate that no random variable can ever have probability 1, I don’t see the issue, or the distinction.

Put another way, if you wish to deny H_0 the status of a random variable and thus the ability to put it to the right of |, I suppose you can do that; but what you have is a cramped probability theory that is a subset of the probability theory that I use. But your preference does not refute Andrew’s (correct, from our point of view) definition of a p-value.

• I’m amazed that competent statisticians are even debating this point. Mayo and Wasserman are absolutely and unequivocally correct. The statement “conditional on H0 being true” is the source of confusion in this context. In frequentist statistics it means “evaluating the probability of a certain legitimate event, say {d(X) > dx0)}, under the scenario that H0 is true; any other interpretation is illegitimate. In frequentist statistics legitimate events are only those that belong to the sigma-field generated by the sample space, i.e. any Borel function of the sample X. As pointed out by Wasserman conditioning on assertions pertaining to the unknown parameters like theta=theta0, makes NO sense in frequentist inference; period!

• That is right Aris Spanos

To me, the main source of confusion is to informally define p-value as:

p = P(T> t; under H_0).

where T is a positive statistic (the greater, the more evidence against H_0). It is not a conditional probability, but this informal definition drives the reader to think it is.

As I explained below:

Here, “under H_0″ means a family of probability measures induced by T that are indexed by theta \in \Theta_0 (the null set).

In other terms: in the parametric frequentist context the parameter theta is an indexer of probabilities. Let the family of all possible probability measures be \mathcal{P}, then we can index this family by a finitely dimensional vector \theta \in \Theta, say:

\mathcal{P} = \{ P_\theta; \theta \in \Theta \}

Basic example:

If \Theta = \{1,2,3\}, then \mathcal{P} = \{ P_1, P_2, P_3 \}, in this case we have three possible measures to explain the observed data, we can test if:

H_0: “Among all measures in \mathcal{P}, P_1 better explains the data”.

or, equivalently and shortly,

“H_0: theta = 1″.

Of course that P_1 is not a conditional probability, it is the very same measure listed before in \mathcal{P}. Again, for a classical (read frequentist) statistician, \Theta is just a set of probability indexes.

I think it is not too hard to understand that a classical statistician works on families of probabilities (\mathcal{P}) and under the null hypothesis we are just restricted to a reduced family (\mathcal{P}_0 \contained in \mathcal{P}). The measures in \mathcal{P}_0 are not conditional ones, they are just a subset of the main family.

• Aris:

You write of correctness and legitimacy and conclude with “period!” All I can say is I’m glad that people such as yourself aren’t in the position of telling people such as myself what can and can’t be done. I had enough of that at Berkeley during my time there, when my colleagues went around saying that what I was doing was wrong (or went around falsely claiming there was no Bayesian applied statistics), without actually looking at what I was doing!

As I and various other commenters noted above, if it makes you comfortable you could replace “|” with “;” and replace “Pr” with “Fr” all over the place. I don’t see the need, but if it makes things clearer to you, go for it!

• Andrew, Spanos:

Here are some suggestions for what we can call these probabilities instead:

(I) Prince Probabilities (probabilities formally known as conditional)

(II) A.P.B. Probabilities (All-Points Bulletin probabilities, because if someone could put out an APB and find those darned missing priors we could call them conditional again. Where did they go? Alaska?)

(III) Contingent Probabilities (because “conditional” is just so gauche)

(IV) Walmart Probabilities (because they’re a poor man’s Bayesain probabilities)

(V) Hashish Probabilities (because it’s something people got a lot of at Berkley)

• I suspect the confusion/disagreement here stems not from whether H0 is a legitimate event (or theta a legitimate random variable) but rather from what one means by the term ‘conditional probability’. Two definitions:

1. P(A |; B)=P(A,B)/P(B) (replace this with something related that handles zero-measure B)

2. “X is the conditional probability P(A |; B)” means “If B, then probability of A is X”.

Even if we agree to use only frequency probabilities and that H0 is not a legitimate event, isn’t the p-value still a conditional probability according to definition (2) above? (As you write, “the probability of a certain legitimate event, under the scenario that H0 is true” – this seems equivalent to “If H0 is true, the probability of a certain legitimate event is p”).

It seems to me that prof. Gelman (perhaps implicitly) uses definition (2), whereas the frequentists here insist on using definition (1). Under definition (1) p-value obviously is not a conditional probability, as H0 is not a legitimate event. However, I see no reason to not use definition (2) as
– it is intuitive that ‘conditional’ something means this – ‘if-then’ is a ‘conditional statement’!
– it agrees with (1) in those cases where B is a random event / random variable

• Juho: that’s probably the final word on the subject since you made it about as clear as possible.

Hopefully, Dr. Gelman will run a similiar post on the times times Journalists and others have misinterpreted the Bayesian stand-in for p-values:

P( mu > mu_0 : data)

Can anyone point to any examples which could serve to illustrations of how difficult it is to interpret this quantity?

• @Andrew and Juho:

[begin Quotation]

I suspect the confusion/disagreement here stems not from whether H0 is a legitimate event (or theta a legitimate random variable) but rather from what one means by the term ‘conditional probability’. Two definitions:

1. P(A |; B)=P(A,B)/P(B) (replace this with something related that handles zero-measure B)

2. “X is the conditional probability P(A |; B)” means “If B, then probability of A is X”.

[end quotation]

I think that there is a huge difference between “conditional probability” and “conditional knowledge”. We cannot interchange these concepts.

I can select a measure by using conditional knowledge without using the probability rules. Classical statisticians use possibility rules to select their initial probability measures, but unfortunately they are not aware of it. The knowledge about other types of coherent measures is missing.

1. We set possibility one to all elements of our initial family of probability measures.

2. For the null hypothesis, we just set possibility one to all elements of the null restricted family (under H0) and possibility zero to all elements outside of this null restricted family.

The problem here is that statisticians (mostly Bayesians) think that *all* uncertain events must be modeled by using probability rules. Of course they are wrong and do not want to see this.

I really appreciate if you can provide a response to this post.

Best,
Alexandre.

• Alexandre:

You write, “The problem here is that statisticians (mostly Bayesians) think that *all* uncertain events must be modeled by using probability rules. Of course they are wrong and do not want to see this.”

I have no idea what people want to see so I can’t comment on that last bit. But I do agree with you that it is wrong to say that any sort of problem must be modeled using probability rules. In my own work, I have found it extremely helpful to model all uncertain events using probability rules. But I recognize that others do excellent work using other methods. I like Juho’s definition 2 but I respect that other notation might be useful for your purposes.

• I think the core point of my post was not captured:

It is not fair to call “conditional probability” for a measure that was selected by using *possibility rules*. Still unfairer if you do this claim in a frequentist context, because this will feed misunderstandings. The selection of the probability measure, under H0, IS NOT made by using probability rules.

*Again and still very important*: All the probability measures listed in the initial family can be indeed conditional probabilities by definition. HOWEVER, given our initial family, when we select a subfamily to represent our null hypothesis, this selected subfamily does not contain *conditional probabilities on the null hypothesis*, they are the very same probability measures that where listed in our initial family. In a classical (frequentist) context, if we consider *conditional probabilities on the null hypothesis* then we are not really using the definition of “conditional probabilities”.

I think that in the light of Measure Theory, all this issues become clear. On the other hand, it is hard to get out of the probability toolbox… I recognize.

• Patriota: “I think that in the light of Measure Theory, all this issues become clear. On the other hand, it is hard to get out of the probability toolbox… I recognize.”

I think the issues do become clear, but not in the sense that you mean. I can always establish a measure space in which H_0 is the only point. Then all the stuff about P(A|H_0)=P(A,H_0)/P(H_0) goes through trivially.

Even from a frequentist point of view, {A,H_0} is a random variable, since your A is a random variable. And P(H_0)=1 under measure theory if that measure space just contains H_0 as the only member.

Again, I don’t see the problem. You are making everyone jump through hoops that don’t really exist.

• Bill Jefferys,

“I can always establish a measure space in which H_0 is the only point. Then all the stuff about P(A|H_0)=P(A,H_0)/P(H_0) goes through trivially.”

Of course you can always do this, however if you are a frequentist you will not do that for the reasons I explain below. A classical statistician use different rules than the probability ones to choose what measures should be included in our initial and null families. A classical statistician define a family of probabilities in which each element as full POSSIBILITY (they set only possibilities and no probabilities, but if you want to impose rules over the initial family you can but it is not necessary), that is, there is no preference among the measures in a very broad sense (it is not the same to set uniform probability or any other objective prior probability, since they all impose a type of preference order).

OK, let’s see why a frequentist do not define prior P(H0)=1.

Suppose X is a binomial random variable (n,p). Then our family of possible measures is P= {P_p, p in (0,1)}, where P_p is a binomial probability measure for each p in (0,1). We want to test:

“H0: 0.4 <= p <= 0.5 or 0.8 <= p <= 0.9"

then, under H0, our family of possible measures are now restricted to the null family: P0 = {P_p, p in (0.4,0.5) U (0.8,0.9)}

You are saying that we can always define a (probability) measure Q that represents the statistician uncertainty such that Q(P0) = 1. We can define, but it is not necessary, actually it is very restrictive, since:

If Q is a probability measure then we must have that

For A, B in P0 disjoint then Q(A U B) = Q(A)+Q(B).

This rule is irrelevant for a classical statistician, the only measure over the family of probabilities that counts for a classical statistician is a more conservative one:

For any A in P0 then Q(A) = 1. This implies that Q cannot be a probability measure, but it is a possibility measure: Q(A U B) = max(Q(A), Q(B)).

• Bill,

People think that if we can always put a probability one in a set that we believe for sure, for instance P(Theta0)=1, however you are imposing the rules of probability to all subsets of Theta0 (and more, many subsets of Theta0 cannot be measured properly, because of the Banach–Tarski paradox for additive measures).

Moreover, if we want a fixed point (in [0,1]) that represents full uncertainty for every event that I am really ignorant about its occurrence, then we MUST give up the probability rules. If you cannot see this, I show you once a submitted paper of mine gets published.

• I agree with Bill but would add

the “hoops that don’t really exist” do exist as abstract representations that are convenient (necessary?) to represent all possibilities but not to represent anything that did happen in this universe (which is finite).

If one defines the subject matter of (applied) statistics as the uncertainties about what happened in this universe, continuous representations are not necessary (though they may be very convenient).

Continuity is a very strange representation (model) which implies many weird and paradoxical things (Axiom of choice, all events individually having probability 0, Borel paradox, etc.,etc.) The sad problem comes when these are projected into reality as problems of or for statistical inference.

• It drives me nuts that I can’t reply to K? O’Rourke or others directly due to nesting issues.

I agree with K? that the paradoxical problems of continuous probability / measure theory is simply not a problem of the actual universe, but a pure mathematical problem. Here is one reason why:

For every measurement of the macroscopic universe that will ever be performed by humans, that measurement will have some resolving power. Let’s pretend that the highest resolving power will come from an electronic A/D converter of some measurement instrument yet to be devised, and that it has 256 bits of resolving power. Today, very high quality A/D converters have maybe 31 bits

So under this hypothesis that the best ever measurement instrument will have 256 bits of resolving power, any scientific hypothesis involving sample spaces larger than 2^256 different possible finite outcomes is not a testable scientific hypothesis. PERIOD.

Now, let’s examine some physical reality of the universe: according to Wikipedia current approximate calculations give the number of protons/electrons in the universe as around 10^80
http://en.wikipedia.org/wiki/Observable_universe#Matter_content

This means to have my hypothetical 256 bit A/D converter we would have to accurately count all the electrons in approximately 1/1000 of the entire universe. I assert that this will never occur, so every probability sample space on scientific measurements has less than 2^256 distinct discrete possible outcomes, each distinct outcome has a perfectly ordinary probability associated to it.

Continuous probability distributions are purely convenience for not having to work with an exactly known quantity of discrete outcomes, and not having to carry around sums that contain 2^15 terms and soforth.

• K? O’Rourke,

Do you agree with Bill about the probability one for the null hypothesis to represent “under H0”, that is, “under H0”, our belief that it is true is Q(H0) = 1, where Q is necessarily a probability measure?

As I showed above, Q cannot be a probability measure when it comes to classical statistics, there exist a mesure Q indeed, but it is other type of measure: it is neither a guess nor an opinion. The problem is to interpret from informal definitions, without understanding what happens behind those informal definitions. The result is an obvious messy.

Statisticians are well-trained in probability operations, so well-trained that most of them dogmatically thing that probability is the unique way of expressing uncertainties. People will say that they are not doing this, but they still are. If you do not know any other measures, just the probability one then the following is possibly applied to you: “if the only tool you have is a hammer, you tend to treat everything as if it were a nail”.

I agree with you about continuity, but you do not question the additive property. This axiom is very restrictive, there are many other types of rules that avoid many problems caused by the additive property. Most probabilists justify their dogmas by creating definitions of “coherence”, of course coherece can be defined in MANY different ways; and I can show you that being coherent under one system you are being incoherente in other.

Best,
Alexandre.

• Just to avoid being misunderstood:

Classical statisticians:

1. Start with a family of possible measures to explain the phenomenon (MEASURE IMPLICITLY USED: POSSIBILITY)

2. Any measure inside this family is a probability measure (MEASURE EXPLICITLY USED: PROBABILITY)

3. Conduct hypothesis testings to reduce to initial family of probabilities to smaller one (MEASURES IMPLICITLY AND EXPLICITLy USED: PROBABILITY, POSSIBILITY AND QUASI-POSSIBILITY).

The measures used in 3. are:
1) a possibility measure to select a subfamily
2) then, it is computed a p-value, which is a probability over the sample and a quasi-possibility over the parameter space.

• Patriota: I am commenting here because some of your stuff just can’t be replied to due to nesting restrictions (Andrew, can we allow maybe 2 or 3 more levels of nesting? this problem happens frequently on your blog)

Bill Jefferys says we can always create a brand new measure space which has exactly one element in it H0 and then assign H0 a probability 1. You then claim that this is enormously restrictive because all the subsets of H0 then have to have a probability structure. But there is only ONE element in this probability space, namely H0, so there are no subsets. QED.

The point is that around here, the Bayesians want to say:

p = P(future data will be as or more extreme | H0)

and they want Frequentists to agree to this, and frequentists don’t agree, because they say that H0 is not random, but the reply from the rest of us is: ok, let’s make it random in a way that should be acceptable to Frequentists: random but certain.

We augment the sample space of outcomes by the cross product with a space containing a finite collection of hypotheses which you’d like to test, and we put a probability measure on this sample space of hypotheses, and to satisfy the frequentists in the crowd we say that

the p value is P(future data is more extreme than observed | p(H0) = 1)

ie. if you don’t want randomness on your hypotheses, then you just require that exactly one element of the space of hypotheses has probability 1 and then all other elements must have probability 0. I think this is insane, but it’s a perfectly reasonable way to go about dealing with the problem that frequentists say that either “H0 is true or it’s not true, and it can’t be random” that’s the same as saying (among the necessarily finite set of hypotheses available, exactly one of them has probability exactly 1). The subsets problem doesn’t occur, because you may not know which hypothesis is true, but we know that the probability measure on the set of hypotheses assigns probability 1 to exactly one hypothesis which should be perfectly fine for frequentists, we’re just going about trying to figure out which of those hypotheses is the one that has p=1.

Now, frequentists may prefer not to do this, but they really shouldn’t in principle object to it.

You’ve already agreed that in actual scientific practice everything must be finite sets, and that continua are really just idealizations that prevent us from having to deal with enormous finite sums and things. So banach-Tarski and all that stuff is not relevant here.

• Daniel,

“Bill Jefferys says we can always create a brand new measure space which has exactly one element in it H0 and then assign H0 a probability 1. You then claim that this is enormously restrictive because all the subsets of H0 then have to have a probability structure. But there is only ONE element in this probability space, namely H0, so there are no subsets. QED.”

OK, but only when the null hypothesis is “theta in {theta0}” (or, equivalently, theta = theta0). When we want to test equality of two population means it does not happen (see also my example on Binomial distribution above), see:

Let X and Y be two independent random variables with normal distributions with variances equal to one (i.e., X ~N(mu1, 1) and Y ~N(mu2, 1)), our vector of parameters is theta = (mu1, mu2) and our parameter space is Theta = R^2 = {(mu1, mu2); mu1,mu2 in R}. We are interested in testing

H0: mu1 = mu2,

therefore, our null set is Theta0 = { (mu1, mu2) in Theta; mu1 = mu2}. There are infinitely many elements there. I can provide many other examples if you want, but I think that the one I presented before and this one are enough to respond you.

You said:

“the p value is P(future data is more extreme than observed | p(H0) = 1)”

No, it is not. You are right that, we can set a measure Q(Theta0) = 1 to represent “under H0”. However, we cannot impose that this measure Q SHOULD be a probability one, frequentists do not do that at all, they use an implicit measure Q indeed, but this measure is just a possibility measure rather than a probability one.

I recommend to read my previous notes to avoid repetitions of the same arguments here. Thanks

All the best,
Alexandre.

• The essence of the frequentist complaint against P(A|B) = P(A and B) / P(B) is that when B is a hypothesis it doesn’t have a probability associated to it. It is either a true hypothesis or a false hypothesis, and therefore not random.

Jeffrys and I seem to claim that if you give us a finite set of distinct hypotheses and insist that exactly one of them is true forever and always (but we don’t know which one!) then we can simply define a probability space in which one of those hypotheses has p=1 and the others all have p=0 and then when we take the cross product of this probability space with all the other probability spaces in question (related to outcomes labeled A above), we get a new probability space in which the conditional probability statement makes perfectly good sense, and in which ALWAYS the true hypothesis is the only outcome that can occur just like the Frequentist requested.

The fact that I don’t *know* the measure over the finite set of hypotheses doesn’t mean there can be no measure. I mean after all if I have a slightly biased die I don’t know the measure it generates either.

In other words, using something other than regular conditioning is just a way for Frequentists to avoid being a special case of Bayesians who only admit probabilities = 1 or 0 over things called “hypotheses”.

• I will mention though that I see that when P(B) = 0 when we divide by it it becomes problematic mathematically. On the other hand, when P(B) = 0 then the numerator is 0 as well. Jeffrys (I expect) and I will say that if you want to test a hypothesis that H0 is true, simply set up a probability measure where p(H0) = 1 and all others are 0, Later when you want to test some other hypothesis H1, you set up a DIFFERENT measure with p(H1)=1 and all others 0.

In this sense, the p values calculated as this type of conditioning are incommensurable, they refer to outcomes within different probability spaces. That’s fine with me. All this is just a way of saying that there are some hoops to jump through for Frequentists to understand p values in terms of conditioning. But these hoops are not logically precluded, they just aren’t the preferred interpretation for frequentists.

• Daniel (Bill and others):

You can set a probability for the null hypothesis or equivalently for the null parameter space. I also can plant a tree in my bathroom, but I will not do that, since it will obstruct a lot my way. Got it? If not, let me explain again:

Let Theta be our parameter space, that is, in a classical context, it is a set of indexes for our family of possible measures: F = {P_theta; theta in Theta}. That is, all P_theta, for theta in Theta are possible measures to govern the data behaviour. That is the beginning, OK?

Does this mean that we are giving probability one to the family F? NO, it does not at all!! Let’s see why?

Suppose Q(F) =1, where Q is a probability measure. Then what are the implications of it?

1. Q(F) = 1 and Q(Empty) = 0
2. If F1 and F2 are two disjoint subfamilies of F, then Q(F1 U F2) = Q(F1) + Q(F2)

Supposing that F is a dense set, we have the following implications:

a) We know from the Banach-Tarsky paradox that there are many subsets in our family F that cannot be measured by using probability rules. That is, it is not possible to compute probability for all elements of the power set of F. A possibility measure can measure all elements of the power set!!!!!!

b) For each P_theta in F, we trivially have that Q( P_theta ) = 0. Here we have a problem: I start saying that P_theta has possibility one, now it has probability zero, is it not strange? Of course it is!! why?? because we cannot set probabilities if we start with possibilities.

That is, Q cannot be a probability measure if we want to consider possible all elements of F. Got it? If not, I am sorry but I cannot explain it here unless you want to understand what I am trying to say.

• Actually, there are two subsets. The empty set is a subset (with probability measure 0), and the subset consisting of just H0 is also a subset (with probability measure 1). This is a perfectly legitimate probability space, just a very simple one. Since everyone, I think, also agrees that A={a} is a probability space, it follows that the direct product of these two is also a probability space, and it is one within which the definition of conditional probability, namely

P(a|H0)=P(a,H0)/P(H0)

for a a subset of {a} makes perfect sense.

• Just to add a comment on one of Patriota’s earlier comments: The two subsets in my trivial space, the empty set and the set consisting of just H0, are disjoint, since their intersection is (trivially) the empty set. And one therefore can compute the sum of the two measures, and it is equal to the measure of the union of those two sets. Which is exactly how a measure space is supposed to behave.

• OK, when Theta0 has only one element it is OK, because in this trivial case probability and possibility measures agree. But, in general, it does not happen, since Theta0 has more than one element: for testing equality of two population means: mu1=mu2, testing if “0.4<=p<=0.5 or 0.5 <= p <=0.8" in a binomial distribution, and so on.

When statisticians open their minds to study other types of measures, maybe these kind of misunderstandings will considerably diminish.

I am honestly waiting a good-and-deep response to the central idea of my posts (not responses to marginal sentences).

Bill, I am waiting a response to my comments. If you disagree with my comments, please let me know why.

All the best,
Alexandre.

• Alexandre:

I think you can handle the case where mu1=mu2 by writing diff=mu1-mu2 and conditioning on diff=0. The same argument goes through.

I don’t understand what you are driving at with your binomial example. Can you be more explicit?

• Sorry Bill, but the null parameter space under “mu1 – mu2 = 0” is the very same: Theta0 = {(mu1, mu2): mu1 – mu2 = 0}, that is, infinitely many (mu1,mu2) are in Theta0.

• Sorry, Alexandre, all you have to do is to transform variables. Use mu1 and diff. It’s straightforward. in the new variables mu1 is ignorable.

• Bill,

If your parameter space is Theta = {(mu1,mu2); mu1,mu2 in R} and you want to test H0: mu1 = mu2 or equivalently H0: mu1 – mu2 = 0, the the null parameter set is the same Theta0 = {(mu1,mu2) in Theta; mu1 = mu2}.

(0, 0) is in Theta0
(1,-1) is in Theta0
(2,-2) is in Theta0 and so on

If your hypothesis is “H0: mu1=m2=0” then your null has only one element Theta0 = {(0,0)}.

Well, there are uncountable infinitely many other examples where Theta0 has more than one element or is a dense set (see the binomial example). In the binomial example I just wnat to test if the proportion is in [0.4,0.5] U [0.6,0.9]. What is your null set? How many elements does it contains?

• Alexandre, you are not getting the point. The ONLY way that mu1 and mu2 enter into the inference is through their difference. Therefore the point that you are making doesn’t make sense.

The easiest way to see this is to write

X1=U1+mu2+diff, X2=U2+mu2,

where diff=mu1-mu2 so that U1 and U2 have expectation 0.

Then take any of the two-sample t statistics for illustration. Make the substitutions for X1 and X2. You will find that the resulting statistic DOES NOT CONTAIN EITHER mu1 OR mu2. It depends on diff, but it doesn’t have either of the others hanging around by itself.

This means that the null parameter space is not as you claim. The ONLY thing you need to worry about is diff.

I still don’t get your other example. You’ll have to be much more explicit.

• Bill,

“You will find that the resulting statistic DOES NOT CONTAIN EITHER mu1 OR mu2”

In this case, the statistics does not depend because it is ancillary to Theta_0 (see all of my other comments on the Larry’s Blog). In this case, the induced measure of the t-statistic is equal for all mu2 \in R and diff = 0. You transformed the parameter space, but you still have one parameter mu2 (that varies in the real line) there, it does not vanish from the null parameter space just because of the ancillary property of the statistic.

You cannot conclude that: as the measure induced by the statistic does not depend on the other parameter, then the null parameter space also does not depend.

I recommend you to read my paper, at least the introduction: http://arxiv.org/abs/1201.0400 (to appear in Fuzzy sets and systems)

Best regards,
Alexandre Patriota

• It is very bad to know that you are not willing to see what I’m pointing. It is a dogmatic position when someone blindly defends one strict position without trying to understand the other side. In my view, in a discussion, first you should try to understand the other side and demonstrate that you understood (i.e., you must be open-minded), then you can show the problems of the other side.

If you are not able to see that the null parameter space for that specific hypothesis is a dense set, and the problem is just because our statistic is ancillary to this dense set, then: I cannot continue either. This is basically a prerequisite for the understanding.

• Please. Do not insult me.

I understand perfectly well what you are saying.

But if you do not understand why your point is entirely irrelevant to my point, then I could (but won’t) say something similar about your attitude.

The measure p(diff=0)=1, p(diff≠0)=0 has nothing whatsoever to do with whatever measures you want to put on mu1 and mu2. That’s the measure I have assigned, and that’s the measure I am sticking to.

I am using it to demonstrate that p(t,diff) is a legitimate probability since it is, with this definition, defined over legitimate probability measure spaces (t being the t statistic). Therefore, the claim that you have to substitute ‘;’ for ‘|’ is only a matter of preference.

Remember, that’s what this discussion is all about.

• It was not an insult, I just asked you to read the introduction of my paper (because there are notations and other definitions that are very difficult to state here) and you said “It is pointless to continue this conversation” (which seems to be a sort of insult).

I perfectly understand your point and show that do understand providing you an explanation for your claim: the statistic t is ancillary to the null parameter space: that is to say: the distribution of the statistic “t” does not depend on the parameters in Theta0. But Theta0 is still a dense set (it is all parameters that satisfies mu1=mu2), it does not change just because t is ancillary to it.

A hypothesis testing is not only of the type “H: theta = theta0”, we can have many other types such as

“H: theta in (0,1)” that is: theta lies in the interval (0,1)

“H: theta in (-1,0)” that is: theta lies in the interval (-1,0)

“H: theta in (-1,0) U (6,10)” that is: theta lies in the union of the intervals (-1,0) and (6,10)

in a general way:

“H: theta in Theta0”.

If you are going to define p-values for these general hypotheses you have to be precise. The definitions p = P(T > t; under H) or p=P(T>t | H) are not well defined, why not you may ask: because there are many induced measures by T, one for each parameter in the null parameter space Theta0 (again: we have only one distribution for T if it is ancillary to Theta0). See the introduction of my paper, if you have any questions I am willing to answer.

Well, that is all I have to say. If you did not get the point, that is my fault for sure…

Best regards,
Alexandre.

• Bill,

One last thing: you said that you fully understand what I said, but you disagree. It would be very nice to see what part of

“the statistic t is ancillary to the null parameter space: that is to say: the distribution of the statistic “t” does not depend on the parameters in Theta0.”

you disagree or is nonsense. Please, explain me where I am going wrong. Please, as you fully understand what I said, use the measure theory language to teach me the right track here.

I am looking forward to hearing from you,

Best regards,
Alexandre.

• @Andrew

My explanations don’t count? I am looking forward to hearing your comments on them.

All the best,
Alexandre

• OK,

I am still waiting your comments on the central ideal of my comments above. Comments on some marginal excerpts do not promote good and honest debate. I made various comments responding you and Juho. Let me know where you disagree, if you disagree.

Best,
Alexandre.

• Alexandre:

Sorry, but I really do think that Juho said it all. You prefer to use different notations in different settings, we prefer to use the same notation. As far as I see it, there’s no need for disagreement or debate. Notation is not magic. Different notation works for different cases. See section 12.5 of my book with Jennifer for further discussion of this point (in a completely different example). I appreciate your comments, but I don’t think there’s anything I can possibly say here that will satisfy you. You’ll just have to accept that there are people who successfully do applied statistical modeling who use notation different from what you are comfortable with.

• Andrew,

Thank you for your response now. But remember that: people who interpret p-values as “probabilities of H0 being true or false” also successfully do applied statistical modeling and you fight to change this view. I agree with you, but I think we must interpret by using formal definitions. To me, the root of these misunderstandings is to interpret p-value from informal definitions, namely: P(T>t; under H0) or P(T>t | H0). We cannot precisely interpret p-values from them. We are in the same boat…

Moreover, it is quite important to note that it is not only a notational problem, it is mostly a procedural problem. The definitions of probability and possibility measures are different. In a frequentist problem you simply cannot say: “conditional probability”, since the procedure to select measures is possibilistic rather probabilistic.

• No, I think that when people interpret p-values as “probabilities of H0 being true or false,” they can make serious mistakes (as in the notorious Bem and Kanazawa examples and also many others that are less obviously wrong, as discussed in various times on this blog and elsewhere). So, no, I don’t think they are “successfully doing applied statistical modeling.”

Regarding your last point, feel free to use the word “possibility” for “probability” wherever you’d like!

• Andrew,

I cant use they interchangeably since they are intrinsically different concepts. As you, but in other context, I do think that it is a mistake.

Maybe in an organized document I can properly show it. Here, posting in a blog with many guessing interventions, I absolutely cannot do it.

Best,
Alexandre.

• “[T]he p-value is … the probability that future data will be more extreme than current data, conditional on the hypothesis being true.”

Seems to me that’s it in a nutshell. An alternative statement? “The p-value is the fraction of future data (in the limit of infinite samples) expected to be more extreme than the associated test statistic, conditional on the hypothesis being true.”

9. I admit, this sounds like nitpicking to me. The article is describing what scientists and journals actually do, not what they might do if they followed idea statistical methods. How inaccurate is this as a description of the actual publication process?

When medical researchers report their findings, they need a method for assessing whether their result is a real effect of what they are testing, or just a random occurrence. To figure this out, they most commonly use the p-value…. By convention, a p-value higher than 0.05 usually indicates that the results of the study, however good or bad, are too likely to have arisen only to chance to be a result reliable enough to publish.

• RJB:

I just don’t see the point of the Times running a wrong description of a statistical method. There’s so much actual science news to report, why waste the space on this? Again, as I wrote in an earlier comment, if he wanted to be a sociologist and describe what (some) scientists actually do, that’s fine, but then state it directly. I’m not arguing against the general idea of simplification in popular science writing, I just don’t think it was done well here. A statement such as “By convention, a p-value higher than 0.05 usually indicates that the results of the study, however good or bad, were probably due only to chance” is at best close to meaningless and at worst simply wrong.

10. I suggest we should talk about value-p as opposed to p-values, were value-p is shorthand for the value of a registered research protocol.

Without value-p p-values are meaningless.

11. As a non-statitician, I can somewhat follow most of the arguments here, but there are too many big words. Really understanding something is whittling an issue down to smaller words.

So here’s how I understand p-value. It just says that IF the null hypothesis is true, then the results would have come up 100p% of the time. You get a p-value of 2%, which means “chance” would have generated the same result 1 in 50 times. Depending on the experiment, that may be a low enough probability to reject chance creating the results.

To put what the NYT article says in perspective, it’s saying a p-value of 6% means that the outcome is likely due to chance because it didn’t meet the significance level. Putting aside being pedantic about statistical theory, this is downright wrong. A p-value of 6% very likely means the results were NOT due to chance. For example, if you only compared a handful of male and female heights and the alternative hypothesis is that males are taller than females, you could very likely get a small p-value but not a significant one. That does not mean male heights were generally higher than female heights by chance.

Basically, it’s a very important thing to put tight boundaries around what experiments do and don’t say. They can generally just say that it’s unlikely that the results were by chance because if it were by chance, the results would happen less than 1 in 20 times. If the results are more likely than that under chance, then we’re not sure. What you don’t want to do, and what the paper says you should do, is say the alternative hypothesis is rejected. It’s not. It could very well still be true, as in male and female height example.

• It’s not quite this. if your p value is 0.02 then chance would have generated this result, *or a result even farther from 0* exactly 2% of the time. (this is assuming that the hypothesis you’re testing is that some value is different from zero of course)

12. The confusion is endemic. It was made a while ago in an article published in a British sociology journal that purported to tell us about the wonders of significance testing (yes, things really are that bad in some disciplines). When I submitted a short note to the journal politely pointing out the error, it was rejected on the grounds that though what I claimed might be true it was disrespectful to the author to point it out and in any way was a trivial point of no consequence. Sometimes one despairs…

13. Did someone say “likelihood-ratio test”?

14. Entsophy:

I agree. In fact I have suggested before that we use

B(A) for belief in A and
F(A) for frequency probability.

Andrew: It’s not notation I am concerned about. It’s the logic.
If you think of a p-value a the prob of the statistic conditional on H0 then it is
confusing. This is at the root of a lot of confusion. It is not a conditional probability.

• I agree completely with Larry and Mayo. It’s not just notation, it gets to the root of what one takes as their meaning of probability. Personally, I align with Popper, I can’t even comprehend what one means when they talk about the probability that a hypothesis (or a model, or a parameter) is true.

• Other people have fewer qualms about assigning probabilities to hypotheses. For instance, there are prediction markets that allow you to place bets on your beliefs, such as the belief that Federer will win the next Wimbledon final (Intrade is the most popular, but apparently it has just shut down) — see http://en.wikipedia.org/wiki/Prediction_market.

If it bothers you to think about the probability of a hypothesis being true, perhaps you can think of it as the credibility or plausibility of a hypothesis (Richard Morey just pointed this out to me). In law, for instance, guilt must be proven beyond reasonable doubt — this is another example of how people apparently have no qualms assigning probability/credibility to a hypothesis (i.e., guilty vs. innocent).

Actually, Andrew, perhaps law is a good example of null hypotheses that are exactly true; in many court cases, the defendant is either guilty or innocent. The argument that everybody is at least a little guilty does not seem very plausible (other than in a religious context I guess :-))

• B(A) for Bayesian probability and
F(A) for frequency probability.

I really like that suggestion. I may start using that notation.

• Although Larry, I might add the following amusing note:

if H=”heads” and f=”number of heads in next 50 coin flips” then I suspect from the way you worded this that you consider:

“F(H) is approximately .5”

as objectively true but,

“B(f) is sharply peaked about .5”

as merely a subjective belief. If so, that’s just too funny.

15. Andrew said: “the old, old error of confusing p(A|B) with p(B|A)…[is] in just about every introductory statistics textbook ever written”
That’s the real tragedy, not its publication in the NYT. If our statistics textbooks can’t get it right, what hope is there for journalists or even professional researchers — all of whom, after all, took that introductory statistics course at an impressionable age.

• I *think* what he meant is that the explanation of the error is in just about every stats textbook ever written (since he didn’t have time to write up an explanation).

• I’m also pretty sure that IS was Andrew meant and in any case, take a look at almost any introductory statistics textbook. Many of them are marketed to instructors with the explicit promise that they provide easy-to-teach, mindless “cookbooks” for statistical “analysis.” Students love it — none of that pesky ambiguity or thinking required, and some instructors love it too. These are the ones that teach if it’s p lt .05 then you have a finding; otherwise not; simple as that. This kind of teaching is the root of much later evil.

• Well I stand by the fact that Andrew’s sentence was ambiguous. The “it” could refer to either the error or the explanation of the error. The textbooks I used in undergrad, grad school and in teaching have all had the correct interpretation of the pvalue, which is why I had my interpretation. I’m surprised and disheartened to hear that there are so many bad books out there!

• I agree it was ambiguous. I was just expressing agreement with Interpretation I (or II if you prefer). A brief history of getting it wrong in statistics textbooks, and a very incomplete list of examples see
http://library.mpib-berlin.mpg.de/ft/gg/GG_Null_2004.pdf

The textbook examples are on page 4.

• This is a great article. Thanks for sharing it. I loved this part:

“The Bayesian posterior probabilities form the id of this hybrid logic. These probabilities of hypotheses are censored by both the frequentist superego and the pragmatic ego. However, they are exactly what the Bayesian id wants, and it gets its way by wishful thinking and blocking the intellect from understanding what a level of significance really is. … The analogy brings the anxiety and guilt, the compulsive behavior, and the intellectual blindness associated with the hybrid logic into the foreground.”

(I’m quoting out of context; it’s not anti-Bayesian at all.)

• I hate to admit it, but Hilary’s interpretation is indeed what I meant! What I was saying was that just about every intro stat book makes a big deal about the p-value being conditional on the null hypothesis, and it not being the probability that the null hypothesis is true.

That said, just about every statistics book says this because it’s a tricky point. As I wrote in my Epidemiology article, the formal view of the P value as a probability conditional on the null is mathematically correct but typically irrelevant to research goals (hence, the popularity of alternative—if wrong—interpretations).

That is, people (including the NYT writer) get this wrong all the time because the correct interpretation does not typically answer the questions that people want or should want to be asking.

• When all you professional statisticians are having such a whale of a time debating what a p-value truly is; what chance does a popular science writer stand at getting it right?

• Rahul:

I’m not having “a whale of a time” here. I find it a bit frustrating. In any case, any scientific field, whether it be statistics, chemistry, biology, or whatever, can be studied at any level of depth. Popular science writers should find the level of depth appropriate for themselves and their audiences and, at that level of depth, not make mistakes. You can look at my own New York Times columns to see examples of popular writing about statistics. It can be done.

16. Let “H_0: theta \in \Theta_0” be our null hypothesis and T a positive test statistics (the greater, the more evidence against H_0).

Many statisticians define the p-value as:

p = P(T> t; under H_0).

It is not a conditioned probability, this way of defining p-values is very confuse. Here, “under H_0” means a family of probability measures induced by T that are indexed by theta \in \Theta_0.

I recommend my recent accepted paper that discusses theses issues:

1. Definition of p-values (on pages 2-3)
2. Problems of p-values (on pages 4-7)
3. New measure of evidence (on pages 7-9)
4. Properties of this new measure (on pages 10-13)
5. Connections with p-values (on page 13)
6. Connections with the abstract belief calculus (on pages 14-18)
7. Open problems.

http://arxiv.org/pdf/1201.0400v3.pdf

• Your paper says it is based on the framework by Darwiche and Ginsberg, which I hadn’t heard of before (and I doubt many others have). I had a quick look at their paper, and do not understand why they think making probabilities non-quantitative is a step forward (at least not if the intention is to develop a normative system for reasoning, rather than to provide a description of reasoning in biological brains). The argument they quote (from a paper on expert system construction) to support this is just that quantitative estimates of human degrees of belief are unreliable – this certainly has practical implications for expert system construction, but I don’t see how beliefs held by actual humans are relevant for a normative theory of reasoning (this is why Jaynes framed his development in terms of a robot – it neatly prevents this issue from appearing relevant at any point).

Is the Darwiche-Ginsberg framework a key part of your proposal?

No, it is not a key part of my proposal. I just showed a connection between my proposal and the abstract belief calculus (the ABC is a unified way of modeling uncertainty: probabilities, possibilities and plausibility measures are special cases).

My proposal is a possibility measure over the parameter space (while the p-value is not even a plausibility measure). With this measure we can provide “objective states of belief” for subsets of Theta. It ranks subsets of Theta that are more plausible than others.

• In that case I think it would help to translate it into a more commonly used language, so the rest of us can read it without extra effort. As it stands, I don’t know what you mean by “probabilities” (Bayesian or frequentist?), “possibilities” (is this Zadeh’s framework? if yes, is familiarity and/or agreement with it a requirement for following your work?) and “plausibilities”. I cannot make a decision about whether I am even in a position to read your work until I know what the technical prerequisites are.

Also, it sounds like your proposal is more Bayesian than frequentist in flavour – in that case, it may be worth arguing for advantages compared to the Bayesian approach rather than compared to the frequentist approach (I don’t think the fact that p-values are not plausibility measures is likely to be seen as a problem by frequentists in the first place – so your sales pitch should be directed at a Bayesian audience).

Let Omega be a non-empty set and A and B be subsets of Omega.

The usual definition of a probability measure P (endowed only with finite additivity) is such that:
1. P(Omega) = 1
2. P(Empty) = 0
3. If A and B are disjoint, then P(A U B) = P(A) + P(B)

It is standard for both Bayesians and frequentists, of course the interpretation changes but the definition is the very same.

The usual definition of a possibility measure P is such that:
1. P(Omega) = 1
2. P(Empty) = 0
3. For any A and B, P(A U B) = max(P(A), P(B))

The usual definition of a plausibility measure P is such that:
1. P(Omega) = 1
2. P(Empty) = 0
3. For any A subset of B, P(A) <= P(B).

Notice that probability and possibility measures are also plausibility measures. That is, the definition of plausibility measure is less restrictive than the first two measures. It is easy to show that p-values are not even plausibility measures (see examples 1.1 and 1.2 of my paper). However, the usual informal definition feeds many controversies and a formal definition is required (see pages 2-3 for this formal definition based on induced measures).

My proposal is not Bayesian at all, since there is no prior probability distributions neither over Theta nor for the null hypothesis H_0. All we have are 1) a family of probability measures for the data (that are represented by the likelihood function) and 2) null hypotheses of the type "H_0: theta in Theta_0".

Read the introduction of my paper, it just requires statistical knowledge. The connection with the ABC is made at the end of the paper.

• Thanks, Alexandre. I think that paper is quite helpful. though I should add that i’m a big fan of Royall (1997) so I’m naturally predisposed to like it….

• Jonathan,

Nice to know you find it quite helpful!

There is a theoretical justification based on desiderata for this measure of evidence (and other types of uncertainty measures). This work is submitted. When it is accepted I can let you know, if it is of your interest.

Best,
Alexandre.

17. In the parametric frequentist context the parameter theta is an indexer of probabilities. That is, a family of all possible measures \mathcal{P} can be indexed by a finitely dimensional vector \theta \in \Theta, say:

\mathcal{P} = \{ P_\theta; \theta \in \Theta \}

If \Theta = \{1,2,3\}, then \mathcal{P} = \{ P_1, P_2, P3 \}, in this case we have three possible measures to explain the observed data, we can test if:

H_0: “Among all measures in \mathcal{P}, P_1 better explains the data”.

or, equivalently and shortly,

“H_0: theta = 1”.

For a classical (read frequentist) statistician, \Theta is just a set of probability indexes.

18. I like what Cohen had to say on the matter:

“Of course, everyone knows that failure to reject the Fisherian null hypothesis does not warrant the conclusion that it is true. … Yet how often do we read in the discussion and conclusions of articles now appearing in our most prestigious journals that ‘there is no difference’ or ‘no relationship’?”

–Things I Have Learned (So Far) http://www.personal.kent.edu/~dfresco/CRM_Readings/Cohen_1990.pdf

19. @Jordan

This is a controversial subject, few people try to think on this considering philosophical and technical issues. Many just repeat and repeat what an icon have said and typically the followers do not understand what the statistical quantities really do, even well-trained statisticians, they seem to be dogmatically trained to answer questions of this type. In my paper (http://arxiv.org/pdf/1201.0400v3.pdf) I also discuss the issue of accepting or rejecting a hypothesis. My proposed measure allows three types of conclusions:

“Evidence to Reject H_0 for some threshold”, “Evidence to Accept H_0 for some other threshold” or “Inconclusive, it is needed more data” (these types of conclusions are related with features of possibility measures and the abstract belief calculus).

Best,
Alexandre

20. Thanks to Larry W and Konrad (maybe others, didn’t read them all) on p-values not being conditional probabilities. The “;” is typically used, but it’s not the notation that matters, it’s the understanding, and there’s a lot of confusion here, surprisingly. Won’t be able to follow up on this just now.

• Correction: what I said was that p-values are not conditional probabilities _in the frequentist framework_. They are of course conditional probabilities in the Bayesian framework.

• >Thanks to Larry W and Konrad (maybe others, didn’t read them all) on p-values not being conditional probabilities.

For better or worse, the internet permits one the opportunity to put forward impertinent as well as potentially foolish questions. I will indulge in that opportunity…

How is a p-value anything but a conditional probability? Perhaps it is not a “Conditional Probability” but it is most certainly a conditional probability. I make that statement without a deep understanding of underlying theory of statistics but with a lot of experience making decisions based on fitting models to data and being able to forecast error rates accurately based on analysis of fit results.

Here’s a typical example: I have two signal hypotheses, H0 (null) and H1 (signal of interest present). For each measurement I must decide H1 or ~H1. (~H1 is probably H0 but it could be something else. The details of ~H1 don’t matter.) Each hypotheses has an associated signal model. I fit each model to each n-dimensional observation. I calculate the sum-of-squared-residuals for each fit, RSS0 and RSS1, respectively. I calculate an F-value based on the number of regressors (fit parameters) in each model and the dimensionality of the data. With the F-value and RSS1 in hand I can make the decision.

Deciding H1 or ~H1 is a two step process. Step 1 = reject or not reject H0 based on the F-value. (If I reject H0 then I move on to Step 2, if not then I decide ~H1 and I’m done.) How do a make that decision? I determine the F-value corresponding to p_crit; p_crit could be 0.05 or higher or lower. Where does that F-value, F_crit, come from? Two options: Option 1 is that I calculate F_crit from first principles based on the presumption that H0 is true, i.e., I take it from an F-distribution with the appropriate numbers of degrees of freedom and a specified value of p_crit. Option 2 is that I look at the actual distribution of F-values for samples where I know that H0 is true and I determine F_crit empirically. Either way, I’m determining the threshold F-value for rejecting H0 from distributions where H0 is presumed to be true.

Moving on… Suppose the F-value for the particular measurement exceeds F_crit? If so then I reject H0 but I don’t necessarily accept H1. Step 2: Having ruled out H0 based on the F-value then I decide H1 or ~H1 based on RSS1. For normally-distributed measurement noise RSS1 will be chi-squared-distributed. (If noise isn’t normally-distributed then I either come up with a different model pdf for RSS1 based on a more appropriate noise model or I determine the pdf empirically based on prior observations.) I decide H1 or ~H1 based on the threshold RSS value which follows from my chosen p_crit and the number of degrees of freedom of the chi-squared distribution. Again, I’m choosing a threshold value which is conditional on the hypothesis being true. In practice, all the thresholds I set are determined from pdfs conditional on particular signal hypotheses being true.

So I repeat my original question: How is a p-value anything but a conditional probability?

PS For what it’s worth, the decision approach above is well-established – see, e.g., Louis Scharff’s work on Matched Subspace Detectors and Adaptive Subspace Detectors.

21. With apologies as I did find this interesting but from Lewis Carroll:

When I use a word,’ Humpty Dumpty said, in rather a scornful tone, ‘it means just what I choose it to mean — neither more nor less.’

22. Holy shit p-values are confusing!

• oh snap — those link bots — I wanted to be “the last word”

23. Pingback: Linkage | An Ergodic Walk

24. I don’t think p-value is that hard to explain to an educated audience, even with a small word count.

A significance test sets up a model, based on the idea that random chance is the reason for the difference we’re seeing. Then we compare what actually happened to this model. If the model says that what actually happened would be unlikely due to chance alone, then we think chance alone doesn’t explain the difference.

The p-value is the probability of getting a difference as large as we saw from random chance alone. If the p-value is low (often we use 0.05 as the cutoff) that means it would be unlikely under the random chance model. Then we think the random chance model is wrong.

• Corey,

This is for you personally, since I think you’re about the only one who would appreciate it.

Random Chance is quite an interesting beast as far as “reasons for the differences we’re seeing go”. In the olden days people had no idea what caused stomach grud and just attributed it to bad spirits, but eventually they discovered E. Coli and were able to get a look at the thing. After millennium of dealing with “Random Chance” though no one has ever been able to get a good look at it. Maybe the scenario you describe actually isn’t whats going on at all.

Frequency distributions (or histograms) are very interesting things. They have an extreme non one-to-one nature to them which is best illustrated by Jaynes’s Entropy Concentration Theorem. Almost every frequency distribution which satisfies a given set of constraints looks approximately like the maximum entropy distribution subject to the same constraints. They overwhelming clump close to maximum entropy distribution in other words.

The chief consequence of this is that frequency distributions (or histograms of data) hide almost all the real causes for the data! The causes of a particular sequence of 20,000 throws of some dice might include practically infinite details about the universe at the time of the tosses, but when you look at the distribution of {1,…,6} outcomes almost every bit of that information is lost through by the extreme non one-to-one processing it undergoes on it’s way to becoming a frequency distribution. About the only information that survives this processing are the constraints which determine the overall shape of the frequency curve.

As a result, once you learn the constraints (or equivalently, understand the shape of the histogram) you can’t learn anything more about the causes. In Jaynes’s famous dice example, he used frequency data from 20,000 throws to learn two or three pieces of information about the dice and was celebrated for it. But once he brought the theoretical entropy down to the actual entropy of the frequency distribution, learning stopped. It’s completely impossible, for example, for Jaynes to have learned anything about Euler’s equations for rigid body motion from that data.

So when you say something could have been due to chance alone, I believe this actually means “we’ve learned as much from the histogram of this data as it’s possible to learn. There surely are lots and lots of interesting things physically going on, but we can’t separate that information out because almost all of those causes would have lead to approximately the same histogram”

That’s also why Randomness can have such different meanings in different contexts. In Finance, when someone says returns are Random, they mean the histogram of return data looks like a Log Normal. For dice, when someone says the throws are Random, they mean the data looks like a Uniform distribution. We say things are non-random when the frequency distribution is the result of constraints we weren’t expecting.

Needless to say, this puts those standard textbook significance tests in a very different light.

• Corey – the problem (as I’m sure you’re aware) is that we don’t think the “random chance model is wrong”, necessarily.

To get these ideas across to a general audience I think you’re better off with signal-to-noise ratios, a.k.a. Z-statistics. If Z is small, we have a signal that far less than noise – so the data is consistent with the true parameter being zero. If Z is large, it’s less consistent. Significance thresholds – a useful but not essential idea – are a way of deciding where to put the small/large threshold for Z, such that there’s an alpha*100 percent chance of us crossing the “large” threshold, when the truth is zero. (Similar language can be used to describe confidence intervals.)

Yes, you have to define “noise” for this to work, i.e. root mean square error of the estimate. But that’s easier than explaining p.

• Hard or not, I’m afraid I don’t think you have done a very good job of it.

When you say “random chance is the reason for the difference” I think you actually mean something like “there is no genuine causative effect, the only thing going on is randomness”. Maybe that is a rather petty distinction, but your further step of converting from a p-value to a belief about the correctness of the model is precisely the fallacy which has been criticised so many times.

If I pick a coin out of my pocket, toss it 5 times and get 5 heads (p=0.03ish), I am not going to immediately think that it’s biased. Or even that it is likely to be biased. I could say I have a “significant” result, and perhaps “reject the hypothesis that it is a fair coin”, but these are conventional statements with a coded meaning that don’t actually carry the plain English implication that they seem to. Which, in a nutshell, is the problem (at least, part of it).

• I concur with your post but one quibble: “If I pick a coin out of my pocket, toss it 5 times and get 5 heads (p=0.03ish), I am not going to immediately think that it’s biased. Or even that it is likely to be biased.” You get result which would be a 1 in 32 occurrence for an unbiased coin and you’re not going to immediately suspect that it’s a biased coin? Not that you’d be convinced it’s biased but you wouldn’t at least suspect it? (I’m reminded of the saying, “An intellectual is someone who can listen to the William Tell Overture and not think of the Lone Ranger.”;-) For better or worse, when I encounter a result which is unlikely under Hypothesis A I immediately suspect there’s another explanation.

• Chris:

No, there’s no such thing as a biased coin (except for an extremely bent coin, but you’d notice that, or a 2-headed or 2-tailed coin, but unless you’re a seller of joke items or somesuch the prior probability of having a 2-headed or 2-tailed coin in my pocket is much much less than 0.03. So I’m with James on this one.

• When I do coin-tossing in class, the probability that my pocket contains a two-headed and a two-tailed coin is 1!

• If I was bored and looking for something to do, I might consider it worthy of further investigation. But my prior for p(heads) on an arbitrary coin is pretty sharply peaked at 0.5!

• And with good reason. Almost every possible sequence of coin flips for large n will have freq(heads) ~ .5. Since almost any set of effects will lead to that same outcome “freq(heads)~.5”, then coin flip experiments are almost useless for learning about the physics of coin flips.

Or to put it another way: the big distinction between low P-values and high P-values, isn’t that in the former there was “an effect” and the later was “Random”. Rather the distinction is that in the former case we have a shot at learning something about the effects whereas in the latter it’s pretty much hopeless. An outcome can’t distinguish between competing hypothetical influences if almost every possibility leads to that same outcome.

The moral of this story is that it’s possible to learn a little something from frequency data, but not very much. You can’t even get close to Euler’s equations of rigid body motion by examining the frequency of heads in coin flips for example. This should give pause to anyone working the in social sciences, life sciences, or big data.

• I could have been briefer with this explanation:

The mapping from the set of effects/influences/causes to the an observed frequency distribution is a highly many-to-one mapping. Therefore observing the frequency distribution doesn’t dell you much about the effects/influences/causes present.

• Note to self: Be more careful in picking my analogies.

1) Andrew, I think that’s great you did those experiments. And I won’t ever pose any “biased coin”-based thought experiments again;-)
2) James, good point about the prior. (The peaked prior actually occurred to me shortly after I posted and inspired a, “Doh!”)
3) I got thinking about the physics of tossing coins – lopsided ones, in particular. A Google search turned up a few interesting papers:

“Dynamical Bias in the Coin Toss”
P. Diaconis, S. Holmes, and R. Montgomery
Abstract
We analyze the natural process of flipping a coin which is caught in the hand. We
prove that vigorously-flipped coins are biased to come up the same way they started.
The amount of bias depends on a single parameter, the angle between the normal to
the coin and the angular momentum vector. Measurements of this parameter based
on high-speed photography are reported. For natural flips, the chance of coming up as

“Probability, physics, and the coin toss”
L. Mahadevan and Ee Hou Yong in Physics Today, July 2011.
No abstract to speak of. Key takehome: You can toss a coin so that it appears to be flipping even though it’s only wobbling.

“Dynamics of Coin Tossing is Predictable”
J. Strzalko et al. in Physics Reports, vol. 469, p. 59-92 (2008)
Abstract
The dynamics of the tossed coin can be described by deterministic equations of motion,
but on the other hand it is commonly taken for granted that the toss of a coin is random.
A realistic mechanical model of coin tossing is constructed to examine whether the initial
states leading to heads or tails are distributed uniformly in phase space. We give arguments
supporting the statement that the outcome of the coin tossing is fully determined by
the initial conditions, i.e. no dynamical uncertainties due to the exponential divergence
of initial conditions or fractal basin boundaries occur. We point out that although heads
and tails boundaries in the initial condition space are smooth, the distance of a typical
initial condition from a basin boundary is so small that practically any uncertainty in initial
conditions can lead to the uncertainty of the results of tossing.