## Confusion about continuous probability densities

I had the following email exchange with a reader of Bayesian Data Analysis.

My correspondent wrote: Exercise 1(b) involves evaluating the normal pdf at a single point. But p(Y=y|mu,sigma) = 0 (and is not simply N(y|mu,sigma)), since the normal distribution is continuous. So it seems that part (b) of the exercise is inappropriate.

The solution does actually evaluate the probability as the value of the pdf at the single point, which is wrong. The probabilities should all be 0, so the answer to (b) is undefined.

I replied: The pdf is the probability density function, which for a continuous distribution is defined as the derivative of the cumulative density function. The notation in BDA is rigorous but we do not spell out all the details, so I can see how confusion is possible.

My correspondent: I agree that the pdf is the derivative of the cdf. But to compute P(a .lt. Y .lt. b) for a continuous distribution (with support in the real line) requires integrating over the pdf on the interval [a,b]. For P(Y=y), this means integrating on [y,y], so P(Y=y) = 0. Think about what would happen were this not the case. One could evaluate P(Y=y_i) for all real numbers y_i, and since each P(Y=y_i) > 0, P(a .lt. Y .lt b) = infinity for all a .lt. b. For discrete distributions, however, P(Y=y_i) can be nonzero.

[I’m using “.lt.” for “less than” to avoid html problems.]

So the probability of drawing exactly y from a normal distribution is 0. The probability of drawing some value around a neighborhood of y, however, is nonzero. Since in this exercise P(Y=y) appears in the denominator of the conditional probability, that probability is undefined. The mistake made in the solution is to substitutes N(1|mu, sigma2) for P(Y=1|mu,sigma2), which is incorrect.

Does my argument here make sense? I have seen this mistake many times. Indeed, I once took a final exam (!) where it was made. (I had to state the probability that an individual’s height was, say, 5’0″, with height a normal r.v. I ended up “faking” the answer by evaluating the pdf as done in the solution to the exercise here, even though the probability should have been 0.)

Me: I think you may have simply misread the problem. In this case, theta is a discrete random variable and y is continuous. I am not asking for Pr(y=1), which, as you note, is simply zero. I’m asking for Pr(theta=1|y=1), which is perfectly well defined, even if not in the wikipedia definition. (Wikipedia is fine, but it’s just the product of its contributors, not all of whom know what they’re doing.) If you really want, you could compute the probability that theta=1 given that |y=1| .lt. epsilon, and then take the limit as epsilon approaches 0.

My correspondent: It turns out conditioning is a complicated thing. I wanted only to find the solution to your problem and was stopped in my tracks by the fact that P(y=1|theta=1) is 0. Indeed, I still don’t know how to solve the exercise on this account — short of computing the limit, as you suggested. For this reason I think the solution as provided is misleading at best, or wrong at worst.

Me: If you look at the solution, I write p(y=1) etc, not Pr(y=1), so I am in fact referring to probability densities rather than discrete probabilities. I agree that it can be confusing. When writing BDA, I considered whether to write a longer discussion about this point, but I decided that further explanation might just confuse people more than if I didn’t mention it. Maybe that was a mistake…

My correspondent: How you got from Pr(y=1) to p(y=1) is what I don’t understand. (Short of “knowing” that the true solution can be obtained by making this non sequitur substitution.)

Me: p(y=1) is a limiting statement based on Pr(|y-1| .lt. epsilon).

Anyway, if this all confused one person, it might be confusing others too, hence the above blog.

1. Bob Carpenter says:

I think BDA is pretty clear up front that it's intended for people who already know about random variables and densities.

The fact that Pr(X=x) = p_X(x) in the discrete case but not the continuous one is deeply confusing for someone who hasn't studied real analysis.

The intro math stats texts skirt around Kolmogorov's beautiful models of probability spaces, random variables, and densities, because they can't hope to introduce the Riemann-Stieltjes integrals required to unify the discrete and continuous cases.

2. Mike Maltz says:

I would assume that P(h=60") is p(59.5" .lt. h .le. 60.5").

3. derek says:

I don't understand the problem. In the interval between time t to time t+0, my car does not change its position. But as long as you don't read the value "50" on the left of my graph of speeds against time as "50 miles", you are in no danger of thinking my car was fifty miles long at that moment (and therefore occupying a range of locations spread across 50 miles). It was actually occupying as close to zero space as a real object can be. The "50" is "50 miles *per hour*".

4. David C. says:

It seems like DePeri's trouble is more nuanced than not understanding about random variables and densities. There's the (confusing) issue of conditioning on events of probability zero (which seems to be treated fairly well in Wikipedia: http://en.wikipedia.org/wiki/Conditioning_(probab

But I'm not totally sure I follow, since I don't have BDA, and it looks like some of the discussion above got cut off in the copy-and-paste.

5. Louis says:

Great discussion.
Last week I had to explain a very similar issue to a student and it took quite some time. This is an issue which comes natural if you have studied a bit of statistics and probability theory. Making this accessible on the spot proved to be less easy (similarly to the discussion above).

I was looking for a resource but I haven't found one yet which explains this the way I would like to. Perhaps in your next edition?

[Not that I want to be lazy and not write course materials myself, but it can be really tricky to write correctly, comprehensively and understandably about this. ]

6. Bob Carpenter says:

I agree that it's a subtle point and hard to understand at first and almost impossible to explain to someone who doesn't know calculus. But if you don't understand continuous conditional densities, you can't hope to understand the continuous case of Bayes's rule, which is what we need to do Bayesian inference with continuous parameters.

One thing that's confusing in BDA is the use of the same symbol for both random variables and bound variables. It's easiest to see in Appendix A with the overview of distributions (p. 586). For normals, they use x as both a random variable and bound variable in the same line, writing x ~ Norm(0,1) and p(x) = Norm(x|0,1). A careful probability theory text would write the random variable as X and bound variable as x, as in X ~ Norm(0,1) and p_X(x) = Norm(x|0,1).

While their notation is usually easy enough to sort out in context, it makes writing event probabilities involving random variables tricky, because the usual Pr(X=x) would look like Pr(x=x) if we conflate random variable X and bound variable x.

7. jimmy says:

regarding notation, i recall a lecture where, as an aside, the instructor complained about expectation notation. explaining, he remarked how although it was great if you knew what you were doing, it was just awful notation for a beginner. and then compared that to unix manuals.

8. Bob Carpenter says:

Expectation notation is great if you stick to using it as it was intended, for expected values of functions of random variables. What I had trouble with was getting through my head that the probability space and random variables were fixed once and for all in any discussion not involving the foundations of probability theory (where the object of investigation is the behavior of probability spaces).

Unfortunately, authors are prone to start decorating expectations with densities and then all bets are off as to what they mean. This is often coupled with broken ideas like "taking samples from a random variable" (should be multiple random variables that are i.i.d.).

The problem the original correspondent had was with conditionals, which is also where expectations start to get tricky. I still remember trying to get my head around E[E[X|Y]] = E[X] (the derivation is shown on the Wikipedia page law of total expectation).

Like everything else, BDA uses expectation notation consistently, if Unix-man-page like. I love the Unix man page analogy — it's the canonical example of great-for-expert and terrible-for-novice documentation.

9. K? O'Rourke says:

Even experts can "over look" this issue.

I don't think anyone would doubt Radford Neal's technical expertise but from his blog

19. Radford Neal | August 26, 2008 at 2:47 pm
[annoying questioner] if one took account of observations being actually discrete rather than truly continuous and replaced the density with the integral from obs – e to obs + e the inconsistency would go away – if e was big enough?

[Radford} Yes. In that case, the data space would be finite (we can ignore the infinity in the big direction), and with enough data, the probability of each of these possible data values would be very well estimated. Values for the single model parameter map to distributions over these finite number of data values, in a continuous and probably one-to-one fashion, allowing the parameter to be well estimated by maximum likelihood.