## Abuse of expectation notation

I’ve been reading a lot of statistical and computational literature and it seems like expectation notation is absued as shorthand for integrals by decorating the expectation symbol with a subscripted distribution like so:

$\displaystyle \mathbb{E}_{p(\theta)}[f(\theta)] =_{\textrm{def}} \int f(\theta) \cdot p(\theta) \, \textrm{d}\theta.$

This is super confusing, because expectations are properly defined as functions of random variables.

$\displaystyle \mathbb{E}[f(A)] = \int f(a) \cdot p_A(a) \, \textrm{d}a.$

For example, the square bracket convention arises because random variables are functions and square brackets are conventionally used for functionals (that is, functions that apply to functions).

Expectation is an operator

With the proper notation, expectation is a linear operator on random variables, $\mathbb{E}: (\Omega \rightarrow \mathbb{R}) \rightarrow \mathbb{R}$, where $\Omega$ is the sample space and $\Omega \rightarrow \mathbb{R}$ the type of a random variable. In the abused notation, expectation is not an operator because there’s no argument, just an expression $f(\theta)$ with an unbound variable $\theta.$

In this post (and last week’s), I’ve been following standard notational conventions where capital letters like $A$ are random variables and their corresponding lower case variables used as bound variables. Then rather than using $p(\cdots)$ for every density, they are subscripted with the random variables from which they were derived, so the density of random variable $A$ is written $p_A$.

Bayesian Data Analysis notation

Gelman et al.’s Bayesian Data Analysis book overloads notation using lower case $a$ for both $A$ and $a$. This requires the reader to do a lot of sleuting to figure out which variables are random and which are bound. It led to no end of confusion for me when I was first learning this material. It turns out disambiguating a dense formula with ambigous notation is easier when you already understand the result.

The overloaded notation from Bayesian Data Analysis fine in most applied modeling work, but it makes it awkward to talk about random variables and bound variables simultaneously. For example, on page 20 of the third edition, Gelman et al. write (using $\textrm{E}$ for the expectation symbol and round parens instead of brackets and italic derivative symbol),

$\displaystyle \textrm{E}(u) = \int u p(u) du.$

Here, the $u$ in $\textrm{E}(u)$ is understood as a random variable and the other $u$ as bound variables. It’s even worse with the covariance definition,

$\displaystyle \textrm{var}(u) = \int (u - \textrm{E}(u))(u - \textrm{E}(u))^{T} du,$

where the $u$ in $\textrm{var}(u)$ and $\textrm{E}(u)$ are random variables, whereas the other two uses are bound variables.

Using more explicit notation which distinguishes random and bound variables, includes the multiplication operators, specifies range of integration, disambiguates the density symbol, and sets the derivative symbol in roman rather than italics, these become

$\displaystyle \mathbb{E}[U] = \int_{\mathbb{R}^N} u \cdot p_U(u) \, \textrm{d}u.$

$\displaystyle \textrm{var}[U] = \int_{\mathbb{R}^N} (u - \mathbb{E}[U]) \cdot (u - \mathbb{E}[U])^{\top} \cdot p_U(u) \, \textrm{d}u.$

This lets us clearly write variance out as an expectation as

$\textrm{var}[U] = \mathbb{E}[(U - \mathbb{E}[U]) \cdot (U - \mathbb{E}[U])^{\top} ],$

which would look as follows in Bayesian Data Analysis notation,

$\textrm{var}(u) = \textrm{E}((u - \textrm{E}(u))(u - \textrm{E}(u))^T)$

Conditional expectations and posteriors

The problem’s even more prevalent with posteriors or other conditional expectations, which I often see written using notation

$\displaystyle \mathbb{E}_{p(\theta \, \mid \, y)}[f(\theta)]$

for what I would write using conditional expectation notation as

$\displaystyle \mathbb{E}[f(\Theta) \mid Y = y].$

As before, this uses random variable notation inside the expectation and bound variable notation outside, with $Y = y$ indicating the random variable $Y$ takes on the value $y$.

1. John Hall says:

Fantastic post! This kind of thing annoys me to no end. Sometimes I see see an expectation subscripted by time, as in the expectation taken from the perspective of someone at time T, rather than time T+1. Always confuses me.

2. Justin says:

While I appreciate these criticisms, I don’t think this post recognizes the need that usually underlies this notation, and it’s not clear what other solution one would be interested in in that case.

If you want to be very careful, an expectation is a function of *both* a random variable and a probability space. The typical expectation notation suppress the probability space, which is fine, assuming it is a constant! However, sometimes we’re interested in what happens when that probability space changes. In such cases, we need some way to denote the expectation a probability space depends on.

I’m pretty sure the notation that’s being criticized here can be made precise if desired: Something like assuming the probability spaces in question both admit densities and are the same except for the densities in question? (Of course this is never done.)

How would you feel about, say, the notation E_P[f(X)] where P is a probability space for X?

• somebody says:

I don’t see a reason for subscripting the distribution function. The random variable X implies a distribution, so you could exchange the probability distribution with $E[f(X)] -> E[f(Y)]$.

• Jackson Monroe says:

I’ve always thought of the notation being required when the distribution of the variable having its expectation taken is in question.

E[X] when I know X is standard normal requires no notation of the probability space because X implies a probability space, it’s a random variable after all. E[X] when X might be normal(/mu_1,1) or normal(\mu_2,1) is less clear, and someone might want to reinforce that discrepancy. Someone correct me if I’m completely mistaken, but it seems like the notation responds to a particular uncertainty.

• Yes, random variables are usually used with the probability space implicit. That’s usually because there’s only ever a single one under discussion. It’s commonly used with conditional distributions, but the standard probability theory textbook way to do that is to use conditional expectation notation.

No, there’s nothing special about standard normal distributions. The distribution of X will be defined by the random variable plus probability space no matter what its distribution is.

While I see densities used to subscript expectation notation all the time, I don’t recall ever seeing anyone add notation to an expectation for the probability space.

• That’s right—the probability space is always implicit when we use random variable notation.

My complaint wasn’t lack of precision but rather lack of compositionality. In the subscripted notation, the expectation is no longer an operator that applies to a random variable. It’s just notatin for an integral.

The need for mixing probability spaces comes in applications like EM or variational inference where it looks like a random variable gets two different distributions. I would argue those aren’t expectations in the usual sense and it woud be clearer to just write the integrals down.

P.S. LaTex still not working in comments.

• Justin says:

Writing expectations allows you to avoid (explicitly) worrying about lots of measure theoretic details that are unavoidable if you want to use integrals (E.g. what if some of the variables are discrete?) Readers find expectations muuuuch more than integrals.

It also seems a strange argument to say that expectations aren’t allowed to have one of their two inputs changed, and that the *correct* rigorous notation is the one that leaves more stuff implicit, but hey — agree to disagree!

• Good point about discrete parameters.

I’m only saying that there’s a conventional definition of expectation as an operator on random variables. Even the Wikipedia page on expectation follows that convention. That convention was established because it makes reasoning about expectations and conditional expectations easier.

I understand how having the probabiltiy space in the notation could help clarify when expectations are being defined over multiple spaces and used together. Usually in applications we’re considering one background probability space, so that’s never come up in things I’ve read.

I do see how subscripting densities can expand the notation to allow shorthand for general integrals. It’s not that it’s unclear notation, just that it breaks some of the nice properties of expectations that hold when defined as functions of random variables over a fixed probability space. For instance, consider linearity, E[A + B] = E[A] + E[B]. I’m not sure how people who like to use densities as subscripts write this, maybe E_{p(a, b)}[a + b] = E_{p(a)}[a] + E_{p(b)}[b]. If A and B are defined over different spaces, then A + B doesn’t even make sense.

Also, how does subscripting a density interact with conditional expectations? I don’t know how to interpret E_{p_X(x)}[X = x | Y = y]. Or really even how to write it. Another commentator suggested subscripting with random variables, but don’t undrerstand how that helps clarify. Do I just get E_{A,B}[A + B] = E_A[A] + E_B[B]?

The notion of expectation is defined for convenience. There are general properties like linearity that hold of expectations that make them useful to reason about. There’s no absolute truth for concepts defined by cognitive agents, just what’s most convenient (that’s a bold assertion of pragmatism on my part—some philosphers or Platonist scientists may object). For notations to be useful, it helps if they follow standard conventions. The French might say “correct” where the British would say “proper” for an activity carried out according to tradition, but that’s not what I’m trying to get at here.

3. Sean Raleigh says:

Thank you! When I was new to probability and statistics, these kinds of notational issues gave me all sorts of trouble. I appreciate knowing that I’m not the only one who feels dumb when picking up a new book and struggling to work out the meaning of all its symbols.

4. Bob, if only everyone were required to learn something like scheme or lambda calculus before being allowed to do math ;-)

expectation(x) = lambda(f,x) integrate(lambda(x) x * f(x),-inf,inf)

Seriously though, I often wish a computational notation would be used more than standard math notation. Standard math notation is always ambiguous compared to something that is supposed to run on a computer.

• I remember the moment when I first used Mathematica and realized the dx in an integral or a derivative was just a lambda abstraction. Too bad I learned calc before type theory.

• Yeah, I never learned anything about type theory, but I think the notion of lambda abstraction and keeping track of the type of objects is *essential* to understanding how to use applied math. For example “functionals” vs “functions” is just stupidity, that’s giving two different names to the same concept when the apply to different types (this is me doing intuitive type theory I guess).

Also I realize I made a mistake above…

expectation(f) = lambda(f) integrate(lambda(x) x*f(x),-inf,inf)

expectations take objects of type “probability density function” (or if you like “nonstandard probability density function” for general measures) and return numbers.

Or if the probability is a measure over something complicated, like vectors or functions, then expectation is a different thing…

suppose F is a probability measure over functions of a single variable (like a gaussian process), then sample(F) is a randomly chosen function of a single variable, and sample(F)(x) is that function evaluated at point x.

expectation(F) is a particular consistent function implied by the measure F, but you can’t write it as an integral without using nonstandard analysis… which is why I like nonstandard analysis…. the thing is sample(F)(x) varies randomly each time you call it, but expectation(F)(x) is always the same number.

• The usual definition of expectation is as a function of a random variable. A random variable implies a density, but it’s fundamentally a different operation. Defined this way, the type of the expectation operator is (SampleSpace -> Real) -> Real. Or if you’re dealing with something multivarate, then (SampleSpace -> Real^N) -> Real^N. As commenters have pointed out, the probability space is implicit. If expectation were to be defined as a function of a density, the type would be (Real -> Real+) -> Real. The reason it’s useful to use random variable notation is that you can apply functions to random variables and thus write things like Pr[A > B] or E[f(Theta)]. In terms of types, an ordinary function f of type (Real -> Real) can always be lifted to a function of type (SampleSpace -> Real) -> (SampleSpace -> Real) by applying it elementwise, so that f(A)(x) = f(A(x)).

• Ooh, type theory… Ok, how about this, in frequentist notions of probability I think of the random variable as a function:

(RNG, N, Real^n -> [0,inf) ) -> Real

and an RNG is (N -> [0,1]), a special kind of sequence of real numbers on [0,1]

The point of random variables is that you can sample them, and sampling them is essentially calling them with successive natural numbers (Kolmogorov being an abstract mathematical person liked the “abstract” sample space… I think that was a mistake in terms of the complications it creates for understanding in applications)

However, one RNG is as good as another in abstract (in computing you need to spend effort to get a good one), so we tend to ignore it… leaving us with a pseudo-type for a random variable:

(N , R^n -> [0,inf)) -> Real

However, when you take an expectation, the natural numbers are irrelevant. The expectation is a property computable entirely from the density… so you’d define expectation of a random variable, by first extracting the density portion, and then integrating it… meaning you could think of Expectation as expectation of a random variable:

((N, R^n -> [0,inf)) -> Real ) -> Real

But it depends critically on the “calculate the centroid” function integrate(x*f(x),dx)

Expectation(x) = calculate_centroid(extract_density(x))

• In nonstandard analysis you could think of the sequence

X = {0 + k*dx} for k from 0…K and where dx is infinitesimal and K*dx = 1

You then get an RNG by filtering the permutations of the integers 0…K, choosing only “high complexity” permutations. So I guess permutations are [0..K] -> [0..K] and composing with indexing into X, and taking the standard part st(x).

RNG(i,n) = st(X(permutation(i,[0..K])(n)))

But assuming the i is fixed as one of the high complexity values, the standardization of the set {0,1,…K} for K a nonstandard integer is the natural numbers… so the standardization of an RNG has type N -> [0,1]

• Andrew says:

Bob:

• Leon says:

I agree with the sentiment that computational notation (and “word equations”) should be used much more than they are, but I disagree that they should be used more than math notation. Some reasons:

1. Unambiguity and clarity are not the same thing. It’s morally correct to fudge sometimes.

2. Some apparent “fudges” in math can be made completely precise, in sufficiently expressive programming languages. An example is overloading functions to work on lists; in Haskell you can “lift” functions like this using “do notation”, in such a way that within a given “do block” it looks like you’re working with ordinary functions. Similarly, in probability theory we implicitly “lift” functions like “+” and “-” from the type (Number, Number) -> Number) to the type ((Ω -> number), (Ω -> number)) -> (Ω -> number). Under the hood this is about “functoriality”, but the broader point is that mathematical “fudges” can run ahead of programming languages’ expressiveness.

3. There are many excellent conventions in mathematical notation that don’t make it into “pure” computational notation. Superscripts and subscripts, for example, are really just ways of visually “demoting” certain function arguments (i.e., denoting partial application). But they’re super useful—I wish they could be used more in programming.

• How about we force every math book to have an appendix with a “glossary” defining their math notation using unambiguous computational notation 😀

• joshua pritikin says:

“I often wish a computational notation would be used more than standard math notation. Standard math notation is always ambiguous compared to something that is supposed to run on a computer.” — Ah, I thought I was the only one who felt this way!

5. Andrew says:

Bob:

Thanks for the explanation. As always, the best notation depends on context. In many settings, notation such as p(x|y) and p(y|x) is perfectly clear, but if we then write something like p(3.2|2.0), this has no meaning at all! You have to write something like p_{x|y}(3.2|2.0) or p(x=3.2|y=2.0). This reminds me of these linguistics paradoxes that I’m sure you’re much more familiar with than I am, where two words are synonyms but it doesn’t mean that they can be used interchangeably in a sentence.

Related to all of this is confusion about replication. I think a big contribution of our 1996 paper on Bayesian predictive checks was the formal introduction of the replication dataset, y^rep. The whole paper’s all about p(y, y^rep, theta) = p(theta) p(y|theta) p(y^rep|theta). Previous work in the field was confused, I think, in part because there was some attempt to use notation such as y and Y beyond what the mathematics could support. When in doubt, make the notation more explicit.

One bit of notation that we used in BDA that I really like was the formal use of distribution names as functions, so that N(.|.,.) is a well-defined mathematical function of three variables. We’ve carried this over into Stan (improving things with the normal by replacing N(y|mu,sigma^2) with normal(y|mu,sigma)). I’ve always found it difficult to explain to students that N(.|.,.) is not just a way of saying that something has a normal distribution; it’s also a function that can be calculated. With Stan it’s easier to explain this. But even with Stan, as you know, there’s confusion, because users don’t always realize that a ~ statement is nothing more or less than adding a term to the objective function.

• As a testament to how much I agree with things being context dependent, I really like the BDA notation for writing models down. So much so that it’s the convention in both the Stan language and documentation. I’ve always seen normal(y | mu, sigma) used as a function; it drives me crazy when I see an applied paper write out a common density function like the normal or Poisson rather than just using the distribution name as a function.

My point is just that it’s hard to learn expectation notation using the BDA notation because it’s so overloaded.

The ~ thing is so easy from a programing language perspective: y ~ foo(theta); is just syntactic sugar for target += foo_lpdf(y | theta);. That was the main insight driving the way I designed the language. I started with a BUGS-like graphical model that needed to be translated to a differentiable density function. If you look at each ~ statement in a BUGS program, you can break out the contribution to the log density as above. Loops just get translated as loops. It was only after coding the prototype and playing around with it that we realized the language could support more general imperative features like local variables, user-defined functions, etc.

6. E Holmes says:

I work often with the law of total variance and I find it hard to keep track of what the expectation is over unless the subscript is there:
var[A] = var_B[E[A|B]] + E_B[ var[A|B]]
or
var[A] = var_B[E_{A|b}[A|B]] + E_B[ var_{A|b}[A|B]]
is much easier for me to read than
var[A] = var[E[A|B]] + E[var[A|B]]
I don’t see how the subscript hurts especially when working with conditional expectations and variances.

Now E[a] when E[A] is meant, that’s impossibly confusing, esp if you see later E[b|a]. Is that E[b|A=a], E[B|A=a], or E[B|A]? Who knows.

• I used to find these nested conditional expectations super confusing, and they still require some scrutiny no matter how you notate them. That’s why I’m urging people to stick to the standard definition and notation.

The trick to understanding the law of total variance, which says that the variance is the variance of a conditional expectation plus the expectation of conditional variance,

var[A] = var[E[A | B]] + E[var[A | B]],

is that the A’s being bound by the E[A | B] notation, so that B is the only thing free. In symbols:

E[A | B] = INT a * p(a | B) d.a

you wind up with a free B on the right-hand side. The right-hand side makes it clear that the result is a function of the random variable B, and hence a random variable itself. Thus it’s the kind of thing we can take an expectation of. To keep things simpler, suppose we just want to calculate the law of total expectation, E[E[A | B]] = E[A]. We work from the inside out, which we can do because everything’s properly defined as a function,

E[INT a * p(a | B) d.a] = INT (INT a * p(a | b) d.a) * p(b) d.b
= INT INT a * p(a, b) d.b d.a
= INT a * p(a) d.a
= E[A]

This used to confuse me to the point where I’d just set the books down thinking the statisticians were simply really bad at notation.

7. Eli Holmes says:

I use the same notation E[A|B=b] but personally I’m going to stay with subscripts in some situations like
var[A] = var_B[E[A|B]] + E_B[ var[A|B]] or var[A] = var_B[E_{A|b}[A|B]] + E_B[ var_{A|b}[A|B]]
versus var[A] = var[E[A | B]] + E[var[A | B]] (no subscripts).

I think that with the subscripts, it is clearer to readers what is going on. I have spent so many hours trying to sort out what p() is being integrated over when reading others’ write-ups. And when I’m reading, at if I see the subscript, I have a pretty good idea of what the intent is, even if I don’t agree with the notation. Sure
E_f(theta|y}(f(theta)}
is not the best but at least it is clear what the intent is. I don’t how many times I’ve come across E(a) or E[A] and what is meant is E[A|B=b] implicitly. And then I spends hours and hours sorting out which E[A] are E[A], E[A|B=b] and E[A|B]. Triple-loading E[A] like this is not that uncommon in a single write-up in my experience. Kind of like the $u$ in var() example you gave in your post.

So not disagreeing about notation but that I think subscripts on E have a place for increasing the clarity of a write-up.

8. Carlos Ungil says:

The correct notation for expectations is obviouly the use of angle brackets. Using E is unnecessarily confusing.

(Some people say we live in the best of the possible worlds. That’s because they have not tried to write here a comment including mathematical signs.)

9. Anonymous says:

The blog engine doesn’t like me…

I sent a couple of comments yesterday, one appears on the sidebar but is nowhere to be found:

https://statmodeling.stat.columbia.edu/2020/02/05/abuse-of-expectation-notation/#comment-1240443

(The second was a correction to the first, maybe the parent was removed and only the child remains but being orphan it’s no longer displayed.)

• Andrew manages the spam filters and sometimes legit responses get lost. But I don’t think he’s going to want to whitelist the name “Anonymous” :-)

• Carlos Ungil says:

I forgot to enter my name in the comment you reply to, sorry.

By the way my original comments said, only slightly tongue-in-check, that using E for expectations is unnecessarily confusing and the correct notation is to use angle brackets. (In my first attempt the blog ate my less-than and greater-than signs, as usual.)

10. Aki Vehtari says:

Thanks Bob,