The typical set and its relevance to Bayesian computation

[Note: The technical discussion w.r.t. Stan is continuing on the Stan forums.]

tl;dr

The typical set (at some level of coverage) is the set of parameter values for which the log density (the target function) is close to its expected value. As has been much discussed, this is not the same as the posterior mode. In a d-dimensional unit normal distribution with a high value of d, the 99% typical set looks like an annulus or sphere or doughnut, corresponding to the points whose distance from the origin is approximately sqrt(d). This intuition is important because it helps understand the challenges of HMC and similar algorithms, which essentially have to cycle around the sphere rather than directly cutting through the center.

The term “typical set” is a standard term in information theory and is more general than the “typical set” concept discussed in Stan (as in the above paragraph). There are some differences between the official definition of typical set and the use of the term in Stan discussions. We can continue to work with the concept of typical set, and it will help for us to clarify the definition when we use the term.

Background

The concept of the “typical set” comes up a lot when we’re discussing Hamiltonian Monte Carlo and Stan. For example, in his case study, “Typical Sets and the Curse of Dimensionality,” Bob Carpenter writes, “Roughly speaking, the typical set is where draws from a given distribution tend to lie. That is, it covers most of the mass of the distribution. This particular choice of volume not only covers most of the mass of a distribution, its members reflect the properties of draws from that distribution.” He also writes that the typical set “is roughly defined as the central log density band into which almost all random draws from that distribution will fall.”

Bob also quotes Bernhard Schölkopf as saying, “a high-dimensional space is a lonely place.” which reminds me of my own line that I use when teaching this stuff: “In high-dimensional space, no one can hear you scream.” The funny thing is, I never saw that movie—I don’t like horror movies—but I remember the slogan.

In his article, “A Conceptual Introduction to Hamiltonian Monte Carlo,” Michael Betancourt refers to “the neighborhood between these two extremes [the neighborhood immediately around the mode, and the complementary neighborhood far away from the mode] known as the typical set” and writes, “In high-dimensional parameter spaces probability mass . . . and hence the dominant contributions to expectations, concentrates in a neighborhood called the typical set.” Michael also writes, “because probability densities and volumes transform oppositely under any reparameterization, the typical set is an invariant object that does not depend on the irrelevant details of any particular choice of parameters,” which I don’t think is correct (see below).

There’s a log of consensus that the typical set is important. But what exactly is it?

Notation and definition

For consistency and clarity I’ll first define some notation. Assume we have a d-dimensional continuous probability density p(theta) defined on some space Theta, so that the integral over Theta of p(theta) d theta is 1. In Bayesian notation we would write p(theta|y), but here I’m suppressing the conditioning on y. So it’s just a math problem now. I’ll also assume we can compute p(theta) and that we can use Stan to get random draws theta from the probability distribution with density p(theta). And I’ll define the target function, u(theta) = log p(theta), the log density function. Finally, the entropy is defined as H = – E(log(p(theta))) = integral log(p(theta)) p(theta) d theta.

I’m being a little sloppy here because typically we can only compute p(theta) up to an arbitrary multiplicative constant, so we we can only compute log p(theta) up to an arbitrary additive constant, and we usually define the target function only up to an arbitrary additive constant too. But for our purposes here we are only working with relative values of the target function, so for simplicity I’ll just pretend the normalizing constant is known. It won’t matter.

The official definition of typical set comes from information theory. Here’s what it currently says on wikipedia (no Wegman version available; sorry): “In information theory, the typical set is a set of sequences whose probability is close to two raised to the negative power of the entropy of their source distribution.” The set is defined in the space of sequences theta_1,…,theta_n. Thus, the typical set is a subset of Theta^n. Wikipedia’s definition using x where we’re using theta, and it’s using log_2 where we’re using log_e, otherwise its notation is consistent with ours.

In Stan, however, people seem to be talking about the typical set as a set of values of theta, that is, the typical set is a subset of Theta, not a subset of Theta^n.

But then we can just take the information-theory definition and use the special case n=1. Then the definition of typical set is: all the values of theta for which log p(theta) is within some epsilon of its expectation. That is, the set of values of theta for which | log p(theta) + H | < epsilon. (OK, the wikipedia definition is less then or equal, but that doesn’t matter here because we’re working with continuous distributions.) 

10-dimensional normal example

I’ll now illustrate with a 10-dimensional normal example in R:

library("mvtnorm")
n_sims <- 1e4
d <- 10
theta <- rmvnorm(n_sims, rep(0, d), diag(d))
u <- dmvnorm(theta, rep(0, d), diag(d), log=TRUE)
H <- - mean(u)

H is the entropy, which can be calculated analytically for this example as 0.5*d*log(2*pi*e). But just to keep things simple and general, I estimated it based on 10,000 simulations; it’s just the negative of the expectation of the log density, and it comes to 14.18. The mode has density -(d/2)*log(2*pi) = -9.19. But, as we well know, the mode is not a typical value of the distribution! The maximum of u in my 10,000 simulations is max(u) = -9.62.

Using the wikipedia definition, the typical set for any epsilon corresponds to the values of theta for which u(theta) is within epsilon of – H. For example, if epsilon = 1, it’s the values for which u(theta) is in the range [-15.18, -13.18].

We can figure out coverage with a little R function:

coverage <- function(epsilon){
mean(abs(u + H) < epsilon)
}

Then we can try some values:

coverage(0)
[1] 0
coverage(1)
[1] 0.3403
coverage(2)
[1] 0.6372
coverage(3)
[1] 0.8475
coverage(4)
[1] 0.9413
coverage(5)
[1] 0.971

As epsilon increases, the coverage approaches 100%.

But this official definition of typical set is not quite what we’ve been talking about with Stan! The problem is that, to get high enough coverage using this interval centered at – H, you’ll end up including the mode after all! Check it out: – H is -14.18, and the maximum value of u is at -9.19. So if we set epsilon = 5, the typical set will include the mode.

Because of the law of large numbers, this problem does not arise with the classical definition of typical set as n gets large. But since we’re working with n=1, it’s a concern. And the definitions implicitly used by the Stan people are clearly working within Theta, not Theta^n. See for example the quotes by Bob and Michael above. Just to be clear, the n in the official definition of the typical set is not the same as the dimension of theta, which is 10 in my example. In my example, Theta is 10-dimensional, and Theta^n, if we were to work with it, would have dimension 10n.

Differences between the official definition of “typical set” and some of how it’s discussed in the Stan community

1. The first difference is that the official definition is in Theta^n, while in Stan it’s used to refer to a subset of Theta. So the use in Stan is a special case of the general definition.

2. The second difference is that, as noted above, the 99% typical set for the 10-dimensional normal as officially defined includes the mode, which is not quite how we’ve been thinking about things. The Stan intuition is not all wrong—as dimensionality increases, 99% typical sets will ultimately exclude the mode. Everything’s fine if d=100, for example. But for low dimensions, such as d=10, the skew in the posterior distribution of log(p(theta)) makes a difference. 10 is not a high number of dimensions, but 10-dimensional problems do come up in applications.

3. The third difference is that the typical set is not invariant to nonlinear transformations of theta. This can be seen as follows: if you transform theta nonlinearly, you’ll need to divide the density by the absolute value of the determinant of the Jacobian of the transformation, that is, you’re adding -log |det(Jacobian)| to the log density. The typical set of theta is the inverse image of a set defined on the log density (that is, you take the values of theta for which log(p(theta)) is in the range -H +/- epsilon), and changing log(p(theta)) by subtracting a log abs det Jacobian will change the mapping and thus change the inverse image. This is not a big deal, nor is it a problem—if you nonlinearly transform theta, you’re changing the geometry of the problem, so it makes sense that the typical set changes too. We should just be clear on it.

You can also see this by continuing with the above simulation in R. Just define phi = exp(theta), that is, the pointwise exponential, taking the exp of each of the 10 elements of theta. We can then compute the determinant of the Jacobian in R:

phi <- exp(theta)
jacobian <- apply(phi, 1, prod)
u_phi <- - log(jacobian) + u

Under the new parameterization, we can compute typical sets of phi by working with u_phi. The typical sets of phi will be inverse images of ranges of u_phi near its expectation. The point is that the inverse image of a particular value of u_phi will not be the same as the inverse image of a particular value of phi. So the typical sets will be different. The issue here is not about cancellation of density and area in computing total probability, nor is it about E(u_phi) being different from E(u). Rather, the issue is that sets of constant u_phi are different from sets of constant u. The rule for being in the typical set of theta is different from the rule for being in the typical set of phi, because the inverse image sets are different.

Of the three items listed above, I think the one that could be most confusing is point #2.

There are various ways to handle point #2:

– We can just accept that in moderate dimensions such as d=10, the 99% typical set can contain the mode.

– We can talk about 50% typical sets instead of typical sets more generally. (Somehow from the discussions with in Stan of typical sets, I’ve been picturing something like 99%, but that’s never really been specified.) But that won’t work if we’re using “typical set” to include most of the domain of integration.

– We can work with a different summary of the distribution of log p(theta). Instead of an interval centered at E(log p(theta)), we could use a central interval or shortest interval of log p(theta) that contains specified probability. Take the inverse image of either of such sets and you’ll get an alternative “typical set” that is doughnut-shaped and excludes the mode in that 10-dimensional normal example. Bob’s case study would work fine if you take the typical set to be the central or shortest posterior interval of log p(theta).

– We can talk less about the typical set and more about the distribution of the log density or target function. For example, instead of saying, “If we start our algorithm far from the typical set, then Nuts takes you toward the typical set, and not until then do you want to start sampling,” we can say, “If we start our algorithm with the log density far from its expectation, then Nuts takes the log density toward that expectation, and not until you are close to that expectation do you want to start sampling.”

I kind of like that last solution. Often when we talk about typical sets we can just talk directly about the log density or target function. I think this could be helpful for users (or developers) because this is the function that Stan is computing anyway!

Also, this last solution reinforces the message that the mode is not “typical.” If we want our simulations to be in the areas of Theta where log(p(theta)) is near its expectation, then we don’t particularly want to be at the mode. By definition, the log density is at an extreme value, not at its expectation, when theta is at its mode.

Again, I think most of the things that people in the Stan community have been saying about typical sets are vaguely correct. It’s just been hard for me to follow some of it without the precise definition, hence this post.

30 thoughts on “The typical set and its relevance to Bayesian computation

  1. There is a definition of the typical set for which (3) does not hold, i.e. the typical set is covariant under reparameterisation. See https://stats.stackexchange.com/a/467667/193742 for the relevant formulas. There is then a reasonable question of whether this definition is natural, etc., but it is perhaps worth noting nonetheless. The typical set implied by this definition is covariant under reparametrisation for the same reasons that the KL divergence between two measures is invariant under reparameterisation, which was convincing enough for me.

    • Sp:

      I could well believe there are some subtleties here that I’m missing! I sent a draft of this post around to a few colleagues and they caught a few confusing points which I tried to clarify, but I’m sure there’s more to the story. I wrote the above post in large part because different things that people were saying about the typical set seemed to be contradictory, so this is an attempt to reveal my own confusion as well as that of others.

      • The definition of differential entropy isn’t invariant to changes in parameterization. Jaynes wanted to measure entropy relative to a base measure defined in terms of a “limiting density of discrete points” and I feel certain that other folk have come up with equivalent notions; such a thing would be required in e.g., Wallace’s (fully Bayesian) Minimum Message Length paradigm. The fact that MCMC doesn’t sample actual continuous values needs to be taken into account, which I believe means that you are right: the typical set as the Stan community conceives of it isn’t invariant to parameterization because the IEEE 754 representation for theta isn’t being taken into a transformed space of discrete values; rather the IEEE 754 spec applies untransformed to phi as well, which means there is no volume Jacobian to balance the density Jacobian.

  2. > “In high-dimensional parameter spaces probability mass . . . and hence the dominant contributions to expectations, concentrates in a neighborhood called the typical set.”

    I don’t think “concentrates” is the right word to use there (it evoques the high probability set, not the much more spread out typical set).

    > Because of the law of large numbers, this problem does not arise with the classical definition of typical set as n gets large. But since we’re working with n=1, it’s a concern.

    Do you mean n=10? Or don’t you think that the dimensionality of the problem is what gives the length of the sequences?

    > And the definitions implicitly used by the Stan people are clearly working within Theta, not Theta^n.

    It doesn’t seem so clear to me.

    • Carlos:

      1. The first quote is from Michael Betancourt, not from me. In his defense, there’s a limit as to how far we can convey mathematical concepts in regular language. His point, I think, is that the typical set is concentrated in the space of u, not in the space of theta.

      2. In my example, d=10 is the number of dimensions of the distribution. I don’t have an “n” in my example. The “n” parameter comes from the definition of typical set, which is defined in terms of theta_1,…,theta_n, so that the typical set is defined in the space Theta^n. I’m following the discussion in the Stan community, in which the typical set is defined for a single theta in the space Theta. Thus n=1.

      3. It’s clear to me. In all the Stan documents, whenever people refer to the typical set they’re talking about a subset of Theta. See the quotes by Bob and Michael above for examples. Or if they are referring to the space Theta^n, I think the documentation needs to be rewritten!

      • 1. I know it wasn’t from you. That discussion of neighborhoods is in the space of theta, if the “concentrates in a neighborhood” has to be understood in other space the formulation is a bit confusing (which was my point).

        2. The typical set in your example is defined in terms of theta_1, …, theta_10 as far as I can see.

        You say “Using the wikipedia definition, the typical set for any epsilon corresponds to the values of theta for which u(theta) is within epsilon of – H. For example, if epsilon = 1, it’s the values for which u(theta) is in the range [-15.18, -13.18].”

        Isn’t theta in the previous paragraph a 10-dimensional vector? Isn’t the typical set composed of 10-dimensional sequences?

        • Carlos:

          Regarding point 2, yes, there’s some confusion in the notation. I’ll try to clean that up in the above post. In the official definition of typical set, there are points theta_1, …, theta_n. That is, each theta_i is itself a vector, a vector of length 10 in my example. It’s my bad that I introduced confusion by referring to theta_1, …, theta_10 near the end of my post.

        • “Because of the law of large numbers, this problem does not arise with the classical definition of typical set as n gets large. But since we’re working with n=1, it’s a concern. And the definitions implicitly used by the Stan people are clearly working within Theta, not Theta^n. See for example the quotes by Bob and Michael above. Just to be clear, the n in the official definition of the typical set is not the same as the dimension of theta, which is 10 in my example. In my example, Theta is 10-dimensional, and Theta^n, if we were to work with it, would have dimension 10n.”

          Andrew, I think you added a clarification in that paragraph. But it’s really far from clear that “the Stan people” are working with N=1. At least this discussion from Michael Betancourt suggests otherwise:

          “In information theory compression is defined with respect to messages comprised of words. In isolation each component word is likely to be near the modal word, but as we construct longer and longer messages we provide more and more opportunity for some of the words to deviate. The probability that a long message contains no deviant words or all deviant words both become negligible, with most messages confined to the set of messages with “typical” fluctuations.”

          n is the number of words in a message (the length of the sequence x_1, …, x_n), the elements of the typical set are messages (sequences) of length n.

          “Probabilistically we can interpret a message as a point in our ambient space, with each word interpreted as the projection of that point along each dimension in the space. Any single component is likely to be found near the projection of the mode because of the pull of the target probability density function. The volume, however, induces fluctuations in some of the components which push the constituent point away from the global mode. Exact samples are “typical”, each influenced by fluctuations that push it somewhere within the typical set.”

          Messages with n words => Points with n coordinates

          https://betanalpha.github.io/assets/case_studies/probabilistic_computation.html#231_where_have_all_the_cowboys_gone

        • Carlos:

          It’s hard for me to say. In the document quoted in my above post, Michael seems to be referring to the typical set as a subset of Theta. But I couldn’t find any precise definition of typical set there, which is why I went to wikipedia.

        • Carlos, this is a source of confusion. In the usual information theory construction, there’s some parameter, which is *usually thought of as a univariate thing*, and then we discuss the typical set in the cross product of that univariate thing across n dimensions (ie. replications of the experiment). So you could think of

          theta^10 as sequences of 10 numbers each of which comes independently from the same univariate distribution, which need to be transmitted across a communications medium

          However, when the Stan group discusses it, they’re thinking usually of a single *vector* parameter, and the vector is not just repetitions of independent parameters, in fact what’s going on is

          Theta^1

          where Theta is a 10 dimensional vector with a complex dependent posterior distribution.

          To use the information theoretic definitions, we’d have to think about sequences like

          Theta^10 which would be a sequence of 10 vectors each of which has say 10 numbers, so that there’s 100 numbers total.

          But that’s not what the Stan guys typically are discussing typically they’re discussing points in the space

          Theta^1

          And I think Andrew is right that there’s confusion in using the words “typical set”. I like much better Andrew’s idea for saying say “the set of typical density” or some such thing.

        • This is what one of the Stan guys writes in the case study linked in the post:

          “For example, Bayesian parameter estimates minimizing expected square error are expectations of the value of a parameter θ
          in the posterior (…) where we integrate over the legal values of θ∈Θ.”

          “Rather than integrating over the support Θ of θ in the posterior, it suffices to integrate over the typical set (…) That is, if Θ
          is the support for a density p(θ), then the typical set T⊆Θ has the property that for “well-behaved” functions, (…)”

          θ is a multi-dimensional parameter {θ_1, …, θ_d}. Θ is the set of all possible values of θ.

          The typical set is a subset of Θ, each element of the typical set is a value of θ, i.e. a multi-dimensional parameter {θ_1, …, θ_d}.

        • What I mean is that it seems that to use the concept of typical set here one has to make the correspondence x_1 => θ_1 , … , x_n => θ_d.

          Which may not make much sense, but I don’t see what alternative interpretation makes more sense.

        • But as Andrew mentioned, that concentration happens in a different space. In the usual example of the multivariate normal (with d going to infinity) there will be concentration of measure in the one-dimensional distribution of radius or density (a new random variable which depends on those many independent component random variables) and not in the original multi-dimensional space where the probability distribution of the multivariate normal is defined.

        • But even in the pure measure theory context concentration of measure is defined in terms of the Lévy mean, not the original multi-dimensional space. I take this to be the general concept and the coding/information theory version to be a specific instance of it where stationarity or independence is assumed, allowing the multi-dimensional measure to be defined in terms of just the transition kernel or marginal distribution.

        • I have no issues with the idea of concentration of measure.

          My point is that including the word *concentrated* in the statement “the region around the mode has low probability and most of the probability mass is *concentrated* somewhere else” creates more confustion than enlightment, given that this “somewhere else” is a extremely larger region where the densitiy is very low.

          Saying “most of the population of England is outside of London” is fine, saying “most of the population of England is *concentrated* outside of London” would be bizarre.

        • @Carlos: Setting aside this notion of typical set or concentration of measure, we’re trying to get across the point that random draws don’t wind up near the mode (they also don’t wind up near each other). You’re right that we’re defining functions like “distance to the mode” or “log density” to show that they are atypical of random draws. The typical draws are the ones you want to use to calculate expectations with MCMC. If you choose draws near the mode, they have high density, but they throw off expectations like E[log p(Y)] (the entropy) or E[(d(Y, 0)] (average distance of a point to the origin). If we just take random draws (or any draws with the right stationary distribution like those from MCMC), these and other expectations work out.

          For instance, in a standard multivariate normal, the distribution of square distances to the origin is chi-squared with degrees of freedom equal to the number of dimensions. Thus we don’t want a bunch of draws near the mode because they’ll throw off expectations like E[log p(Y)] or E[d(Y, 0)].

          In HMC, we use the gradient not to literally climb toward the mode, but to exert force toward the mode on our simulated particle that’s balanced with the particles momentum to keep the flow in the volume where we want to sample.

        • > The typical draws are the ones you want to use to calculate expectations with MCMC. (…) If we just take random draws (or any draws with the right stationary distribution like those from MCMC), these and other expectations work out.

          It’s a good thing that random draws are not different from typical draws! Of course that’s why we say they are typical in the first place.

          If you want to stress that random draws from a multivariate distribution will be far from the mode, because everything is far in high dimensions, just say that.

          It seems that people gets confused when they understand that instead of drawing from the distribution they should be drawing from some mysterious typical set, as if it was a different thing.

          A much more clear message, I think, would be to explain that one has to take draws according to the actual distribution instead of “choosing draws near the mode”, whatever that means.

        • > we’re trying to get across the point that random draws don’t wind up near the mode

          And still, they are more likely to wind up in a unit volume around the mode than in a unit volume anywhere else. It’s true that they are more likely to wind up in a much larger region that excludes the mode. But the points in that region are very different from each other (unlike the points in a small region around the mode).

          In the multivariate standard normal example the points at some fixed distance from the origin look exchangeable given the spherical symmetry of the distribution but if those co-ordinates represent parameters in a model the symmetry disappears. {mu=1, theta=-0.3, rho=0.2} and {mu=0.1, theta=0.6, rho=-0.9} are completely different animals.

          While I’m stating the obvious, I think this may be another source of confusion for those who try to understand the relevance of these ideas.

  3. Hi Andrew,

    Many thanks for writing this blog article!

    As a newcomer to the STAN world (but not Bayesian statistics), perhaps the most confusing topic I have encountered has been the “typical set”. As an analyst, I understand why I want to use NUTS/HMC, but thus far haven’t managed to appreciate why I need to concern myself with the “typical set” Here are some observations on your informative notes:

    In your d=10 Multivariate Normal (MVN) example, you mentioned that the mode had log density:

    -(d/2)*log(2*pi) = -9.19

    The expectation of the log density you observed across your 10000 samples (at -14.18) should be (and is) about 5 lower, as we are summing the squares of 10 independent N(0,1) variates, which will be chi-squared on 10 degrees of freedom. Hence this chi-squared distribution will have a mean of 10, and variance of 20. Thus we could describe the distribution of the log-density as something like:

    Log-density = -(d/2)*log(2*pi) – chi-squared (d) / 2

    Various metrics of this simple “typical set” (like expected value, percentiles etc.) could be determined using the above for any d.

    However I would ask, what value is this? Why should I care? Consider the following 5 sets of theta1-theta10:

    SET A = (-sqrt(10), 0, 0, 0, 0, 0, 0, 0, 0, 0)
    SET B = (-1, -1, -1, -1, -1, -1, -1, -1, -1, -1)
    SET C = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
    SET D = (1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
    SET E = ( sqrt(10), 0, 0, 0, 0, 0, 0, 0, 0, 0)

    All except SET C has a Log-density of -14.19, since the sum of each term squared adds up to 10. SET C is our mode, and hence a Log-density of -9.19. I struggle to see how SETS A, B, D and E are notable in any way. Sure, their average contribution to the log-density is “typical”, but what more beyond that? In the joint parameter space, none are particular interesting. If we are to provide STAN with initial estimates, I cannot see why SET C wouldn’t be reasonable/appropriate, although clearly no random MCMC sample will ever have all 10 theta’s exactly equal to zero (and hence SET C is a bound on the log-density, and hence “atypical” from a log-density perspective, but not from a parameter perspective).

    For me, the goal of the d=10 example is simply to obtain 10000 MCMC samples which yield (approximately) the true means and var-covariance matrix for theta1-theta10. In the real world, these joint distributions of our model parameters will not be MVN, and any summaries (as means, modes, variances, covariances etc.) will always be crude and simplistic representations of potentially much more complex interrelationships.

    I hope this note was useful and error free, and I look forward to reading what others think on this interesting topic, since it would be useful to clarify for new/current users.

    Cheers

    Al

    • Al:

      As noted above, I think the Stan community might be better off abandoning the term “typical set,” at least we can agree on a definition. Just a few comments:

      1. To be consistent with our notation, you should label A, B, C, D, and E as “points” or values of theta, not as “sets.” We’re using the term “set” to refer to a region within Theta, not a single point.

      2. When we’re talking about typical sets, we’re talking about values of log p(theta), which can be thought of as a one-dimensional aspect of theta.

      3. There’s a reason to focus on p(theta) rather than some other one-dimensional summary. That reason is that Hamiltonian paths preserve the total energy. The function p(theta) is the potential energy in the Hamiltonian path.

    • (As Andrew pointed out, the elements of the typical set are points.)

      > All except POINT C have a Log-density of -14.19, since the sum of each term squared adds up to 10. POINT C is our mode, and hence a Log-density of -9.19. I struggle to see how POINTS A, B, D and E are notable in any way. Sure, their average contribution to the log-density is “typical”, but what more beyond that?

      I think that’s all. What makes the set of points at that distance sqrt(10) from the origin kind of notable is that when you take points at random from an N-dimensional standard normal (with N very large) they will all surely be at distance sqrt(N) from the origin (using an appropriate definition involving epsilons and limits and that kind of things).

      In the same way that when you take a sequence of N coin flips (with N very large) you will surely get the same number of heads and tails. In that sense, a sequence of 10 flips with 5 heads and 5 tails more typical than a sequence with 6 heads and 4 tails or a sequence with 1 head and 9 tails (even though the probability of getting 5 heads and 5 tails is less than 25%). [The textbook definition of typical set would actually include all the possible sequences for a fair coin, as all have the same probability. I’m using “typical” in a more general sense in this paragraph.]

      You’re right that we are not usually confronted to infinite-dimensional standard normal distributions, though. The discussion about typical sets is interesting but I don’t know about its practical relevance.

      • Hi Carlos,

        Thanks for adding to this discussion. I fully agree with your comments “I think that’s all” and your last sentence “..but I don’t know about its practical relevance”.

        cheers

        Al

  4. Hi Andrew,

    1) Yes, better to label A-E as points A-E. My observation is that points A, B, D and E are unremarkable/uninteresting to me, but happy to hear why I should care (e.g. A and E were deliberately made to be odd/unrepresentative in the parameter space).

    2) I agree that log p(theta) is a univariate summary, but am struggling to understand why it is interesting to anyone!

    3) Indeed the Hamiltonian preserves the total energy, but if Hamiltonian=U(potential) + K(kinetic) energies, and K is chosen randomly at each HMC iteration, why do we care about any particular value of U (=log p (theta))? For example, if H = -5 (e.g. U = -4 and K = -1 OR U = -2 and K = -3), the value of U (-4 or -2) is not particularly of interest. Being provocative, who cares?

    There perhaps are ways the “typical set” is of actual use, so hopefully someone will chip in and defend why and how they use it, which would be very interesting to me!

    Cheers

    Al

    • 2) E[log p_Y(Y)] is the entropy of Y. It’s a more general way of thinking about uncertainty than variance. When Y is discrete, E[log_2 p_Y(Y)] is how many bits on average a compressive coder needs to encode a stream of draws from Y. That’s where the notion of typical set comes from—that stream’s average log density converges to the entropy.

      3) The standard notation for Hamiltonian mechanics is H(q, p) = V(q) + T(p) where q is the position in parameter space and V(q) = -log p(q) is the potential energy. The kinetic energy is usually taken to be T(p) = p’ * inv(M) * p, where M is the mass matrix. The reason we care about H(q, p) is that’s where we do the equivalent of the Metropolis accept step and that’s what’s preserved when we solve the implied differential equation for the trajectrory of a particle with velocity p and position q.

      But back in the world we care about, the velocity variable p is just for convenience and it’s the q variables we care about. Now we can get draws for q from draws for (q, p) just by dropping p, which is what we do. Then we can look at properties of the distribution of q.

      The reason the typical set is useful is that it’s the domain for computing expectations, which is all we ever use MCMC for. We don’t explicitly constrain ourselves to it (in fact, there are infinitely many typical sets depending on coverage), but the effect of preserving the Hamiltonian in the dynamics keeps our draws in the typical set.

      Andrew’s post isn’t saying this isn’t true, just that the bounds of the typical set were a bit more generous than we thought and it’s not the only set for which this construction works, just one that’s convenient for proofs information theoretically.

      • Hi Bob,

        Thanks for finding the time to reply.

        2) I understand something about the origin of “typical set”, but see no relevance of these ideas in our field (=using HMC for Bayesian modelling). For example, Y is not discrete in our field. I am not encoding anything!

        3) Hopefully I understand HMC. The U and K nomenclature came from Radford Neal’s (brilliant) “MCMC Using Hamiltonian Dynamics” book chapter, so hopefully not completely alien to you :-) For what it’s worth (on the nomenclature generally used for HMC), it might be helpful to consider replacing “q” with theta, since I expect most of the STAN users “think” model parameters = theta (or at least never use “q”). The only advantage I see in using “q” is consistency with early published material (but that shouldn’t completely restrict us…e.g. the H in HMC changing from “Hybrid” to “Hamiltonian” for example).

        You wrote “The reason the typical set is useful is that it’s the domain for computing expectations, which is all we ever use MCMC for. “.

        I am not sure I understand what specific expectations you are taking, and what relevance there it has to the “typical set”. For me, I give each point(=vector of parameters(thetas)) of my, say 10000 samples, 0.01% of my belief. I then construct summaries based on functions of these parameters (e.g. outcome = theta1 * (theta2)^theta3). No where am using the “typical set” per se, but rather am just assuming my 10000 samples are indeed a (pseudo) random sample from the joint posterior distribution of my model parameters. Perhaps in your applications you are doing something else, hence my ignorance/question here.

        You wrote “We don’t explicitly constrain ourselves to it (in fact, there are infinitely many typical sets depending on coverage), but the effect of preserving the Hamiltonian in the dynamics keeps our draws in the typical set.”

        As you mention, there is no actual “typical set”, but rather it is just a construct of whatever coverage we would want it to be (= I see it has no relevance to anything). I see no special relationship between the “typical set” and HMC. You could get a “typical set” from any MCMC method…it is no more specific to HMC than standard Metropolis-Hastings (MH) (…in Andrew’s nice example, he isn’t using HMC, and we are just looking at the distribution of log density).

        Thus far, I still have found anything useful/relevant about the “typical set”. If we want to explain that our joint posterior distributions are very difficult to effectively explore in high dimensions via, for example, MH, we can, but we do not need to reference any “typical set” to do that. Of note, Radford Neal’s book chapter on HMC makes no reference to the “typical set”.

        Hope this was clear. This has proved very useful to me…thus far I haven’t read anything to make me think I am indeed “missing something” here (as I originally feared).

        cheers

        Al

        • Al:

          I followed up with this on the Stan forums:

          Talking about log p rather than the typical set has been very liberating, and now it is making me think of some other questions.

          One way to think of the typical set is in terms of level sets of log p. The typical set of theta corresponds to the level sets of log p(theta) near its expected value. But then this makes me thing of . . . slice sampling! In slice sampling, we’re implicitly reparameterizing the density in terms of level sets of p, which is mathematically equivalent to level sets of log p.
          In particular, slice sampling has two steps: Draw theta from within the level set of constant p(theta), and draw a value of p(theta) given theta. That first step (draw theta given p(theta)) is a lot like HMC. This is no surprise–the original NUTS algorithm uses slice sampling–but at the very least it’s helping me understand what people have been talking about when they say “typical set.”

          Hamiltonian paths have constant energy: that’s potential energy (i.e., – log p) and kinetic energy (from the speed of the particle moving through space). In his famous 2011 article, Radford talked about the challenge of HMC that it stays in a narrow band of potential energy. HMC can move efficiently within the level sets of log p, but it has random walk behavior in the dimension of log p itself.
          But this makes me wonder: are there some HMC diagnostics that would help us understand this? We’re already monitoring log p and have a sense of how it varies. But we also know the distribution of kinetic energy of the Hamiltonian paths. As Bob put it (https://statmodeling.stat.columbia.edu/2020/08/02/the-typical-set-and-its-relevance-to-bayesian-computation/#comment-1399953), “In HMC, we use the gradient not to literally climb toward the mode, but to exert force toward the mode on our simulated particle that’s balanced with the particles momentum to keep the flow in the volume where we want to sample.”

          So it seems that we should have internal evidence from our HMC runs (in warmup as well as in sampling) as to how the sampler is moving within and between level sets of log p. I’m wondering if this sort of thinking could move us beyond talking about the typical set?

        • Thanks Andrew…I will check out the comments there.

          Perhaps the thing I am missing is that the “typical set” is useful in the initial optimisation/warm up phase used in STAN. After we have a “good” set of MCMC samples, the relevance of the “typical set” is then not important (…the angle I have been coming from).

          cheers

          Al

        • Andrew: You say, “Radford talked about the challenge of HMC that it stays in a narrow band of potential energy. HMC can move efficiently within the level sets of log p, but it has random walk behavior in the dimension of log p itself.”

          I don’t think I said that, since I don’t think it’s true, at least when comparing HMC with simple alternatives such as random walk Metropolis updates.

          Caracciolo, Pelisseto, and Sokal pointed out that a Metropolis update can change log p by only order 1 on average, and this constrains how fast a pure Metropolis method can be, since it has to explore the range of log p by a random walk with steps of order 1. HMC has a Metropolis step, but it also has a momentum replacement step, which can change log p by an amount that is typically comparable to its range (about half the range, in some simple cases, I think).

          Maybe you’re thinking of a comment I may have made somewhere or other in a different context about how trajectories accepted by HMC can’t typically change H by more than about order 1. This is of relevance when the variation in log p is not just due to high dimensionality, but also to extreme skewness or multimodality. For example, suppose a high-dimensional distribution has two modes, one narrow and high, the other broad and low, with equal probability in each mode. You can’t expect trajectories that move from one mode to the other to be accepted often, because the probability density of a point in the low, broad mode is much, much lower than for a point in the high, narrow mode. (If the difference in modes is extreme enough, variation in the kinetic energy won’t be enough to solve this problem.) This motivates the modifications to HMC discussed towards the end of my 2011 article.

        • Radford:

          Thanks for commenting!

          Regarding your first two paragraphs: I’m starting from the position that HMC is better than simple Metropolis. When I talk about problems with HMC, I’m not saying that Metropolis is better, I’m saying that these are problems that even HMC gets stuck on.

          But I don’t have a good intuition about this. I guess the way for me (and others) to understand your last two paragraphs above is to do some simulations with some simple examples.

Leave a Reply

Your email address will not be published. Required fields are marked *