**Ensemble methods**

[cat picture]

By ensemble methods, I (Bob, not Andrew) mean approaches that scatter points in parameter space and then make moves by inteprolating or extrapolating among subsets of them. Two prominent examples are:

There are extensions and computer implementations of these algorithms. For example, the Python package emcee implements Goodman and Weare’s walker algorithm and is popular in astrophysics.

**Typical sets in high dimensions**

If you want to get the background on typical sets, I’d highly recommend Michael Betancourt’s video lectures on MCMC in general and HMC in particular; they both focus on typical sets and their relation to the integrals we use MCMC to calculate:

- StanCon 2017: Everything you
*should*have learned about Markov chain Monte Carlo (starts at 4:40 or so; sorry for crummy video—the conference was DIY on a tight budget)

- Tokyo Stan: Scalable Bayesian Inference with Hamiltonian Monte Carlo

It was Michael who made a doughnut in the air, pointed at the middle of it and said, “It’s obvious ensemble methods won’t work.” This is just fleshing out the details with some code for the rest of us without such sharp geometric intuitions.

MacKay’s information theory book is another excellent source on typical sets. Don’t bother with the Wikipedia on this one.

**Why ensemble methods fail: Executive summary**

- We want to draw a sample from the typical set
- The typical set is a thin shell a fixed radius from the mode in a multivariate normal
- Interpolating or extrapolating two points in this shell is unlikely to fall in this shell
- The only steps that get accepted will be near one of the starting points
- The samplers devolve to a random walk with poorly biased choice of direction

Several years ago, Peter Li built the Goodman and Weare walker methods for Stan (all they need is log density) on a branch for evaluation. They failed in practice exactly the way the theory says they will fail. Which is too bad, because the ensemble methods are very easy to implement and embarassingly parallel.

**Why ensemble methods fail: R simulation**

OK, so let’s see why they fail in practice. I’m going to write some simple R code to do the job for us. Here’s an R function to generate a 100-dimensional standard isotropic normal variate (each element is generated normal(0, 1) independently):

normal_rng <- function(K) rnorm(K);

This function computes the log density of a draw:

normal_lpdf <- function(y) sum(dnorm(y, log=TRUE));

Next, generate two draws from a 100-dimesnional version:

K <- 100; y1 <- normal_rng(K); y2 <- normal_rng(K);

and then interpolate by choosing a point between them:

lambda = 0.5; y12 <- lambda * y1 + (1 - lambda) * y2;

Now let's see what we get:

print(normal_lpdf(y1), digits=1); print(normal_lpdf(y2), digits=1); print(normal_lpdf(y12), digits=1);

[1] -153 [1] -142 [1] -123

Hmm. Why is the log density of the interpolated vector so much higher? Given that it's a multivariate normal, the answer is that it's closer to the mode. That should be a good thing, right? No, it's not. The typical set is defined as an area within "typical" density bounds. When I take a random draw from a 100-dimensional standard normal, I expect log densities that hover between -140 and -160 or so. That interpolated vector y12 with a log density of -123 ** isn't in the typical set**!!! It's a bad draw, even though it's closer to the mode. Still confused? Watch Michael's videos above. Ironically, there's a description in the Goodman and Weare paper in a discussion of why they can use ensemble averages that also explains why their own sampler doesn't scale---the variance of averages is lower than the variance of individual draws; and we want to cover the actual posterior, not get closer to the mode.

So let's put this in a little sharper perspective by simulating thousands of draws from a multivariate normal and thousands of draws interpolating between pairs of draws and plot them in two histograms. First, draw them and print a summary:

lp1 <- vector(); for (n in 1:1000) lp1[n] <- normal_lpdf(normal_rng(K)); print(summary(lp1)); lp2 <- vector() for (n in 1:1000) lp2[n] <- normal_lpdf((normal_rng(K) + normal_rng(K))/2); print(summary(lp2));

from which we get:

Min. 1st Qu. Median Mean 3rd Qu. Max. -177 -146 -141 -142 -136 -121 Min. 1st Qu. Median Mean 3rd Qu. Max. -129 -119 -117 -117 -114 -108

That's looking bad. It's even clearer with a faceted histogram:

library(ggplot2); df <- data.frame(list(log_pdf = c(lp1, lp2), case=c(rep("Y1", 1000), rep("(Y1 + Y2)/2", 1000)))); plot <- ggplot(df, aes(log_pdf)) + geom_histogram(color="grey") + facet_grid(case ~ .);

Here's the plot:

The bottom plot shows the distribution of log densities in independent draws from the standard normal (these are pure Monte Carlo draws). The top plot shows the distribution of the log density of the vector resulting from interpolating two independent draws from the same distribution. Obviously, the log densities of the averaged draws are much higher. In other words, they are atypical of draws from the target standard normal density.

**Exercise**

Check out what happens as (1) the number of dimensions K varies, and (2) as lambda varies within or outside of [0, 1].

*Hint: What you should see is that as lambda approaches 0 or 1, the draws get more and more typical, and more and more like random walk Metropolis with a small step size. As dimensionality increases, the typical set becomes more attenuated and the problem becomes worse (and vice-versa as it decreases).*

**Does Hamiltonian Monte Carlo (HMC) have these problems?**

Not so much. It scales much better with dimension. It'll slow down, but it won't break and devolve to a random walk like ensemble methods do.

Thanks, Bob. Now that you spell it out, the intuition is very clear of why this fails in high dimensions. Do you think there is an area where these methods are useful? It would seem that costly, non-diffable, and fairly low-dimensional models (as used in astrophysics maybe) are actually perfectly suited for emcee.

Also, at what number does “high” start?

Thomas:

Based on my .234 paper, it seems that 5 = infinity.

The main utility of having an algorithm that only requires the density is that you can plug in a black-box. Many of the forward models in fields like astrophysics or energy management involve enormous external packages for which gradients aren’t available. These are easy to plug in because they do give you the density.

As far as size, if I recall correctly, it was higher than 5 dimensions where they just stopped mixing well, but I can’t remember the exact number. Certainly by 100 dimensions they weren’t mixing at all. Peter was probably evaluating with a simple hierarchical regression. And we validated it was getting the right answer in low dimensions to the extent we understood how to do that at the time.

P.S. Here’s a link to the feature request and discussion and the branch on GitHub. I can’t find the evals, as it was ages ago.

The answer is O(10) dimensions, where the coefficient depends on the extract structure of the problem being considered. Any algorithm that devolves to a random walk will end up behaving according to the results in Andy and Gareth and Gilks’ incredibly important paper. Goodman even implicitly acknowledged that in a paper, https://arxiv.org/abs/1509.02230, although he seemed reticent to acknowledge that in person.

It’s funny, in grad school I was consistently frustrated with theorists who would call any number greater than 1 (including 3, I kid you not) infinity. But when it comes to random walk diffusions in statistics, N = 5 is not that far off!

We love Hamiltonian Monte Carlo not because we work on it, rather we work on it because it’s the only algorithm that theoretically should scale to high dimensions!

I don’t actually think your last sentence is right here. HMC is like Bayes: it proposes a clean solution that, if done correctly (and this is the challenge) works in all nice (read: not discrete) situations. That doesn’t mean it’s the only algorithm that can scale to high dimensions. But it may be the only useful structure for building such algorithms.

As I and my colleagues argue in the HMC foundations paper, I’m happy to use any algorithm that constructs a measure-preserving flow on the target space. We just only know how to do that for smooth spaces, which yields HMC. I am eagerly awaiting the day we can do figure out how to do it for tree spaces next.

I have heard whispers that someone has worked that out…. But I’m gagging for the details!

I just ran across this preprint, “Probabilistic Path Hamiltonian Monte Carlo” (http://arxiv.org/abs/1702.07814). May be relevant?

You wrote: ” Goodman even implicitly acknowledged that in a paper, https://arxiv.org/abs/1509.02230, although he seemed reticent to acknowledge that in person.” It seems like you confused Jonathan Goodman here (co-author of the original Ensemble Samplers with Affine Invariance) with Jesse Goodman (co-author of Properties of the Affine Invariant Ensemble Sampler in high dimensions).

I agree on the utility. I currently work with a 16 dimensional space where i use the Goodman&Weare method. I don’t think it is feasible to use HMC. It works but it does take forever to mix. Unfortunately, I don’t think I have any better options…

Aslak:

Can you write your model in Stan?

I would say that the issue is already apparent in low dimensions. If I sample randomly from a segment connecting two points drawn from N(0,1) I won’t get the same distribution but one which is narrower: N(0,0.82). The width of the 95% highest density interval, for example, will be 82% of the original one. For a two-dimensional normal distribution this happens for each axis, so the area of a highest density region will be 67% of the original one. For a three-dimensional normal distribution, the HDR will be a sphere with 54% the volume of the original one. And as the number of dimensions keeps growing, the distributions generated will be more and more concentrated at the center of the original distribution (i.e. the hyper volume of the highest density hypersphere will be a exponentially decreasing fraction of the hyper volume of the original highest density hypersphere).

Sounds like this calls for less naïve ensemble methods that are designed with concentration of measure in mind.

@Corey and @Carlos Ungil:

The ensemble methods will get the correct stationary distribution asymptotically. The proposal from Goodman and Weare’s paper, for example, requires a Jacobian adjustment for change-of-volume along the interpolation path in the Metropolis accept step—it’s a pretty subtle point in the paper that took me a while to understand (Goodman helped by sketching it out on the board after a talk—that was before I understood Jacobians properly; it’s clear now why they have to do that just looking at the geometry of the typical set).

And they allow extrapolation moves, not just interpolation moves, so it’s not like you’re going to be doomed to climb up to the mode. (Though if you want to do that, you wind up with something that resembles Nelder-Mead gradient-free optimization.)

The crux of the problem is that interpolating or extrapolating between pairs of points in the typical set is not likely to wind up in the typical set. So it’s just a bad way to generate proposals. The proposals that will succeed are forced by the geometry of the typical set to be ones that don’t move far from one of the points in the set being interpolated/extrapolated.

Bob, what do you think of my proposal below for SPSA based stuff? as you say above, the advantage is you can plug in the black box type models. If Michael Betancourt applied his magic and figured out a way for SPSA to drive an HMC transition, and Stan could call out to a black box function evaluator using popen or some such thing… you guys would basically own the MCMC world both for analytical models where Stan can calculate the gradients using auto-diff and for black box models where you can use anything you like as a simulator.

I don’t mean simply proposing interpolations/extrapolations along lines connecting pairs of points; I mean something crazy like, I don’t know, iterating between using a variational auto-encoder to learn the typical set sub-manifold from an ensemble state and then using it as an independence sampler to propose a new state.

One that I’ve had simmering in the back of my mind for a while now is using the Simultaneous Perturbation Stochastic Approximation algorithm for the gradient, and then doing essentially HMC using these stochastic ensemble of gradients.

The context where this is useful is where the model is a black box code, like an astrophysics simulator or ODEs or a PDE solver or a molecular dynamics simulator or an agent based model or something like that. The idea with SPSA is in every dimension of the parameter space you either jump forward epsilon or backward epsilon. You then evaluate the function at this new perturbed point, and you use dF/epsilon_i as an estimate of the partial derivative in the direction of parameter i. Of course nothing keeps you from performing this several times and averaging to get a little less noise. But in high dimensions a numerical derivative calculation takes N evaluations of the expensive function whereas SPSA takes O(1) evaluation.

Then, you take an HMC type step and start the process over.

Now, you could imagine this as an HMC simulation in which the hamiltonian is itself randomly perturbed through time, that is, the error in your SPSA gradient calculation is an error with respect to the true posterior F of interest, but is considered as “correct” with respect to a constantly changing F* whose average through time is F. This is kind of like saying you’ve got a thermostat in a molecular dynamics simulator which adds and subtracts heat to give constant temperature, but the actual energy in the system at any given time varies sort of randomly through time.

That’s the intuition at least, the question of whether it could be shown to converge to what you want and soforth is left to the reader as an exercise ;-)

We don’t know exactly where the precision boundary is for which problems. We’ve found in some cases that we don’t have enough precision with existing algorithms and have to improve the iterative algorithms or modify some algebra. I don’t know where you might be able to get away with finite differences or single-precision floating point precision (about the same), or where we might need more than double precision. We’re also not sure about issues with using an approximation for the value of a function and an approximation of the derivative, rather than the derivative of the approximation. Taking derivatives of the approximation is easy and we know we’re differentiating the function we’re using, but taking an approximation of the derivative gives us more precision relative to the true derivative.

A big concern for us would be the instability of finite differences. I imagine you get the same problems with the method you describe or there might be more complicated steps to try to mitigate instability by adjusting step sizes.

Bottom line is that it’s an open question as to how much numerical precision is required under what conditions in the derivatives.

I guess that doesn’t surprise me about the issue in precision. When it comes to SPSA they find that for optimization purposes, the random errors in the evaluation of the gradient don’t matter in the long run, this is probably because you’re trying to find a direction to climb up hill and as long as you do go regularly up hill… you wind up higher than before. For optimization this is all that matters in some sense.

HMC, maybe not so much.

It would be interesting to do some very basic experiments, implement a generic HMC evolver in R for example, then plug in a multivariate normal distribution and an SPSA approximation to the gradient function, and see how it works. Maybe do something similar for an ODE problem, where you calculate the trajectory of a ball through a moving fluid and infer the velocity field using SPSA for the gradient of the posterior for the velocity field parameters… a typical kind of application for a black-box.

I think if someone gives you a tuned up HMC procedure, you could implement it pretty easily in R and see how it works, but the part where you actually need to tune the HMC proposal…. that seems more tricky. I don’t know a thing about how you guys tune up proposals in Stan.

Bob, it looks like these guys had a similar idea to mine:

https://arxiv.org/pdf/1503.01916.pdf

perhaps there’s something there that could be incorporated and generalized in Stan.

No. https://arxiv.org/abs/1502.01510.

The problem is that HMC moves too quickly for the stochastic approximations to average out to the true value. It’s literally too fast for all of the common approximation methods to be worthwhile. But that’s the reality of high-dimensional spaces. You do it right or you fail miserably.

SPSA isn’t a data subsampling method, it’s a local stochastic approximation to the numerical gradient method. See the paper linked above or here:

https://arxiv.org/pdf/1503.01916.pdf

The idea is essentially to do numerical differentiation in O(1) steps instead of O(N)

I could well see that it might be a big lose compared to auto-diff in a model with a well specified exact likelihood… but I don’t necessarily see that it’s obviously a loss when you’re working with an ABC type scheme and there is no well specified exact likelihood and the alternative is basically not to do inference at all.

@Michael

Does that mean there isn’t (or even stronger, they may never be) any way to do HMC for black box models (no autodiff possible)?

Open up the black box and reimplement so you can compute gradients.

In other words no?

Seriously though, imagine an agent based model of the genetic effects on spatial processes of chipmunks foraging for food, hiding nuts in caches, and trying to find mates. The agents operate on a 100×100 square grid with initially randomly assigned locations for trees and cache sites…. You are collecting statistics on total time spent by each chipmunk in each condition (foraging, caching, and mating) the total calories cached, the number of offspring for each genotype. You run the model for 5 seasons from a randomly assigned initial condition.

What does a gradient even mean here?

Very naive question from someone who has only played with Stan for about 5 minutes.

I have some large, mostly black-box PDE models. I also happen to know someone who just wrote code to compute model derivatives for these models.

Can I now use Stan for these models? That is, does Stan just need (external) functions for the model + model derivatives? Or would I need to actually implement the PDE models within Stan?

Ojm:

Yes, you can run Stan supplying your own calculated gradients. I’m not quite sure where in the documentation are the instructions for doing this. You could send a question to stan-users and someone would point you to instructions.

Thanks, will do

@Daniel:

Dumb question: How is SPSA different from just old-school partial numerical differentiation. The forward-backward-epsilon-jump part you described sounded just that.

You simultaneously perturb all the coordinates and evaluate the function once at the new point, and then use the difference between this one evaluation and the start point to calculate the derivatives.

if you start at point x, you perturb all the dimensions at once get a new point x* and then do:

(f(x*) – f(x))/dx_i is the partial for dimension i

so instead of N evaluations of f, one for each dimension, you do just one.

Interesting. So is that a free lunch, in general? i.e. When evaluating a partial derivative numerically, I’d have normally used a one-at-a-time finite difference.

Something like this:

https://mat.iitm.ac.in/home/sryedida/public_html/caimna/pde/Third/third.html

So if a single perturbation can give me all those partial derivatives, does that make the conventional finite difference approximation redundant?

Or is there a downside?

It’s not a free lunch — it’s a way of trading off computing time for accuracy in the finite difference calculation. SPSA is a stochastic approximation to the finite difference calculation; the more time you spend generating random perturbations of your state and averaging, the better your approximation.

That’s right, on average it will produce the right answer, but for any single attempt it will have noise in it. Whether this noise matters depends on what you’re doing with the gradient information.

One thing to consider though is that under some kind of central limit theorem concept, the error should go down like 1/sqrt(N) and for small N this decreases very rapidly, whereas for large N it decreases more slowly. So in something like 3000 dimensions, you could imagine that averaging 10 or 20 SPSA iterations would do pretty well for many purposes where the gradient doesn’t have to be super-precise, and be a factor of 3000/20 faster, but, if you really need to drive your error down to extremely small levels, even doing 3000 iterations might not be enough, so you might as well do the each-dimension version.

However, whether exploring a high dimensional space in an HMC like fashion would work… that’s a different story. Langevin dynamics, where you head in the gradient direction, and then diffuse from there might be a better solution for dealing with problems where you’re simulating stuff like the agent-based model I mentioned elsewhere. In cases where your simulation includes stochastic mechanisms within it (like the decision making of the chipmunks) there’s no sense in which outcomes are necessarily continuous functions of the parameters, and so the gradient doesn’t exist without some kind of regularization scheme.

The situations that I know of where ensemble methods are used commonly are things like weather prediction, where there are an enormous number of unknowns (I’m not a size queen and I hate the “my model is bigger than yours” rubbish that happens, but their model is bigger). But the people who do this don’t actually expect much out of their ensembles: they get a “good enough” measure of centre and no sort of idea of variability. It’s not that they’re ok with that, it’s more that nothing else scales to the gigantic and massively nonlinear/dependent setting yet, so they make do.

These people use Ensemble Kalman Filters (and their variants). I’m not familiar with either of the methods you mentioned, but a glance at their papers has the “statistical homeopathy” feeling (big claims, nothing examples).

This is different from ensemble MCMC, which tries to use an ensemble of states to construct efficient Markov transitions. In weather/climate/related fields the ensemble methods can be interpreted as crude implementations of Bayesian inference over an ensemble of possible _models_ not just states with respect to one target model. Here even a crude sense of center can be useful in trying to understand the underlying physical system. Not a great understanding, but better than a single point estimate from one model.

Great post Bob! You guys should do more blog-posts in this style.

We’re working up to writing the Stan book. Blog posts or knitr case studies are much easier.

Dumb question – is the goal to sample the high probability set (something something…smallest set of probability 1-eps) but you use the typical set as the nearest approximation? Or you are intrinsically interested in the typical set?

The usage of the term “typical set” is a little confusing. Wiki has this defined as: “In information theory, the typical set is a set of sequences whose probability is close to two raised to the negative power of the entropy of their source distribution. That this set has total probability close to one is a consequence of the asymptotic equipartition property (AEP) which is a kind of law of large numbers. The notion of typicality is only concerned with the probability of a sequence and not the actual sequence itself.”

So in that context, it’s thought of as a set of sequences of repeated draws. You might say that the entire MCMC run should produce ONE point chosen from the sequence of repeated draws of high dim vectors and as long as this ONE sequence of draws is in the typical set of the set of sequences of independent draws… you “might as well” consider the MCMC draw a set of independent draws.

The typical set is pretty loosely defined and is usually defined in terms of a finite or countable alphabet of symbols, so for continuous distributions it should be thought of as applying to a discretized grid of possibilities.

My impression is that the use of the term “typical set” in Bob’s posts is more or less synonymous with your “something…something smallest set of probability 1-eps”. The thing about probability is that there is usually no hard boundaries to the edges of sets, so if you draw a volume around some region of space and say “this is the high probability region” and you get *one* point that is just a tiny bit outside that region have you failed? Obviously not. Because of that, I don’t think you can do better than a slightly fuzzy concept of membership in the “good” set.

Thinking further about the concept of typicality of a sequence. Suppose that an MCMC technique produces a sequence of parameter vectors x_i (each x is itself a high dimensional vector of parameter values) now suppose that this sequence is “atypical” (ie. not in the typical set of sequences) precisely because it has serial correlations due to the exploration mechanism.

Now suppose that x_q(i) is a randomly chosen permutation from among all n! possible permutations of the x_i sequence. Suppose that x_q(i) is in the typical set of sequences for “almost all” permutations. This in essence says that each of the individual x_i values was “sufficiently typical” that none of them are too surprising, and so the only thing “wrong” with the MCMC sequence as a member of the set of typical sequences was its serial correlations. That would seem to be a good property for an MCMC method to have…. it rarely draws a vector that is “weird”

so I think that property that each draw from the MCMC sequence is itself not so weird as to put the sequence x_i outside the typical set of sequences is what they’re after.

Exactly what Daniel Lakeland said—it’s the set you need to integrate over to compute expectations with respect to the density. The point is that if you grab points around the mode, they don’t have the same properties (like sufficient statistics — sample means, variances, etc.) as general draws from the distribution.

I don’t see anything wrong in principle with including the mode point etc, the problem is including it too many times during a sampling algorithm.

The idea that the typical set is easier/faster to compute and is sufficient for expectations is fair enough to me, though.

Right, but the frequency with which the modal point should be included goes to zero very rapidly in high dimensions. So when you’re in something like 100 or 2000 dimensions, if you see the modal point once before the end of the universe you’re doing it wrong.

Michael explains this nicely in his paper on HMC. From a nonstandard analysis perspective, suppose we break down the parameter space into cubes of side dx. Then we compute the probability to be in a region a certain distance from the mode. We suppose we’ve scaled the dimensions of x in such a way that locally for small differences around the mode p(x) is isotropic (for ease of discussion)

p(x) is the density, and N(r)dx is the number of cubes at distance r multiplied by the volume of the small cubes.

N(r) grows like r^D where D is the dimension of the space, whereas p(x) because we’re at the mode, is more or less constant, or at least something like (1-r^2) for r near 0.

so the product goes like (1-r^2) r^D = O(r^D) so initially the probability to be away from the mode is growing like r^D which for large dimensions is super-fast. eventually as we get far from the mode, the density drops very fast so that the whole thing integrates to 1, but there’s some distance r* where p(x) N(r)dx has a peak and this is where all the probability is…

as D increases, the function r^D stays near zero for farther and farther distances, and then shoots upward extremely rapidly, so the probability to be farther out from the mode increases exponentially with D the dimension.

In problems like where you’re simultaneously measuring 5000 variables with measurement error (like say a bioinformatics problem), you wind up with 5000 dimensions and probability to be near the mode could be 10^(- 5000) times as small is probability to be in the middle of the “typical set”. “The average” never happens in the history of the universe.

What does “the middle of the typical set” mean? Maybe being in some region around the mode is not going to happen in the history of the universe, but it is even more unlikely that you’ll ever be in any given region of the same volume in what you call the typical set.

I’m with ojm, I don’t see a reason to exclude the mode. It’s unlikely… but less so than any other point. As long as your sampling includes this region with the correct probability it’s as good as (actually better than) the rest of the high probability region.

Personally, rather than

“The typical set is a thin shell a fixed radius from the mode in a multivariate normal”

I find it more natural to say that

“The typical set is a ball around the mode in a multivariate normal” keeping in mind that, as another commenter said, “most of the volume of a multidimensional orange is in the skin”.

Ah, didn’t see this one before I replied to Daniel.

RE: “The typical set is a ball around the mode in a multivariate normal”

I think it most consistent to say something like

“The [high probability set] is a ball around the mode in a multivariate normal”

while

“most of the volume of a multidimensional orange is in the skin [and this is what we call the typical set]”

We can hence approximate the high probability set by the typical set for most intents and purposes (especially computing expectations).

ojm, I agree but I’m not sure that sampling from this typical set will be easier than sampling from the full (and esentially identical) high probability set.

Good point.

I assumed it was easier because that seems to be what people do (well, according to this post).

But this might not be the case, depending on the algorithm (I have no idea!).

@ojm: This is a common misunderstanding and I’m about to roll out a case study that will help people understand what’s going on. There’s no philosophy here, just mathematics!

The key thing to understand is that while the mode may have a very high

density, the volume arond the mode is very small, so the probabilitymassaround the mode is zero to any degree of precision we care to compute it. The volume grows as the power of the dimension, so when you look at density times volume (really integrating density over volume), you’ll see that all theprobability massis in the thin shell to which I referred.Posterior expectations, which is what we compute in Bayes for parameter (point) estimates, predictions, event probabilities, etc., can be computed to within epsilon by averaging over the typical set, where epsilon is smaller than floating-point precision (let’s say 10

^{-300}with double precision).The key simulation to convince yourself of this is to take completely random draws from a multivariate normal. You can take billions of draws and none of them are going to be anywhere near the mode.

@Bob

My comments are strictly about the mathematics, ignore my side comment on the philosophy.

And I am referring to probability not density.

You referenced Mackay. Compare his smallest sufficient subset which his typical set. Different objects with different definitions. The former is ideal that latter is effectively identical to it but not exactly. Same thing we’re discussing.

Sorry meant ‘compare…with…’ obviously.

Ugh, many typos (excuse – was just woken up early on a Saturday by a fire alarm). But you get the gist.

ojm: replying at the bottom of the page because this is once again a little too nested

Sure, but you agree that *in principle* a Bayesian should average over the high-probability set (as ‘defined’ above) right?

It’s just that when you are restricted by finite sampling in high dimensions it is better to average over the typical set as this is easier to get a characterisation of and is *approximately* equivalent to the high-probability set, despite excluding the most probable individual point.

Otherwise what is the Bayesian justification for considering the typical set rather than the high-probability set?

Imagine you have the standard unit normal. Where is the “high probability set?” -10^30 to 4.95 or -5 to 5 ?

saying that you need to talk about the mode because it’s inside the high probability volume is like saying you need to consider the region (-inf,-3) because it’s “inside” (ie. to the left of) the “shell” whose outer boundary is near 5

You never need a point near the mode in high dimensions, precisely because you’re trying to produce a sequence of draws that is as good as a sequence of independent draws, and in an oracular perfect independent RNG for your posterior distribution, that modal point will never occur for a sequence of length less than 10^3000 or whatever.

Let’s put it another way, again using my favorite nonstandard analysis. Suppose that you decompose the whole space into cubes of side dx an infinitesimal. Now, you create N “balls” for N a nonstandard integer, and on each one you write an x vector and an integer from 1 to N.

Now you place each of the N balls in its appropriate box of size dx centered at the x you wrote on the balls, and you’ve arranged it so that (n(x)/N)/dx ~ p(x)

Now you draw a random sequence of K integers between 1 and N for K a standard integer, and pull out those balls and take the standardization of the x vector on each ball. This is now an “independent p(x) RNG” (asserted without proof).

Ideally, your MCMC system should produce a sequence of x values which is “as good as” the above independent nonstandard sampling algorithm at least for the purposes of calculating expectations. Since the expectation is insensitive to the order in which the sum is performed, the serial correlations won’t matter provided the expectation you get is “the same as” the expectation you would have gotten from one of the independent draws.

If the frequency with which you would get something near the “modal point” from the independent nonstandard RNG is very very small, then you should never get the modal point in basically any finite K-length sequence of draws coming out of your RNG (aka your MCMC procedure).

Perhaps then, we can simply define the “typical set” as any subset of the possible sequences of length K which gives you an “accurate” expectation for “almost every” function f(x).

hand waving, but I think that’s the basic intuition.

re:

“Imagine you have the standard unit normal. Where is the “high probability set?” -10^30 to 4.95 or -5 to 5 ?”

I believe high-probability subsets are relatively well-characterised (but I assume not necessarily unique for continuous distributions).

For a given eps ‘it’ is defined by the smallest subset containing at least 1-eps probability.

For a discrete distribution this most definitely includes the mode. Construction: order the points by probability, use a greedy algorithm.

The point here is not that you need to include the mode to get a good approximation to posterior expectations. The point is the principle vs the approximation.

What are you approximating with the typical set? Is it *necessarily* bad to include the mode? The seeming implication of the post is that it is definitely bad to include points outside of the typical set, e.g. the mode.

I was making a very simple, very pedantic point that this is true of particular sampling algorithms, not of characterising the posterior in general. In fact, it would be very worrisome for philosophical Bayesians if it was true in general.

Which gives me the very naive idea that one issue with the walker algorithm could be that it (presumably) keeps the number of walkers fixed, rather than the approximation probability (ie eps in the 1-eps above).

What if you allowed the number of walkers to vary? Eg aimed for the smallest number of walkers giving a fixed probability level. Or something.

> Perhaps then, we can simply define the “typical set” as any subset of the possible sequences of length K which gives you an “accurate” expectation for “almost every” function f(x).

I agree. For example, a typical point from a high dimensional normal will have a good mix of positive and negative coordinates. We could define the typical set as excluding the “all-positive” hyper-quadrant. Anyway, I will “never” sample one of those points before the end of the universe. But why would I want to do that? I see no reason. Why would I want to exclude the mode? I don’t see the reason either.

ojm, Carlos:

Take the standard independent unit normal in 100 dimensions.

Now, calculate the density of the statistic sqrt(sum(x_i^2)) this is the radius from 0 of a draw from the 100 dim normal.

This will have density equal to the chi distribution for 100 degrees of freedom (or maybe 99? I’m too lazy to look it up and it doesn’t matter for my point).

Now the density of chi(100) has mode at sqrt(99) and its density is proportional to x^99 so it has zero density at x=0

Now, this is a 1 dimensional distribution where our intuitions are much better. Where’s the high probability region of the chi? if you agree that it doesn’t include 0 where the density is actually zero, then it seems that the 100 dimensional normal shouldn’t *ever* draw a vector whose radius from 0 is 0, and also that almost all the vectors from the unit normal in 100 dims should have radius near sqrt(99).

You can recurse on this, for example take any of the choose(100,10) subsets of 10 dimensions and they should all have chi(9) distributions…

In every case, every chi has density zero at r=0

so the “modal point” in the unit normal is actually REALLY REALLY WEIRD it is NOT part of the high probability region of the 100 dim normal under pretty much any definition you can imagine.

As I note above, the typical set

isthe high probabilitymassset by definition. There is no probability mass to speak of around the mode because the volume is so low.You don’t need a Bayesian justification for considering the typical set. It’s a purely measure theoretic argument based only on the mathematics. Note that there’s nothing Bayesian in what I wrote above — this is just about where the probability mass is in a multivariate normal distribution.

@Bob, I think the issue is that the typical set is … not that well defined whereas ojm’s “smallest volume that contains 1-epsilon probability” is in principle pretty well defined if not unique, but you could say “any of the smallest volumes that contain 1-epsilon probability” and you’d probably be doing pretty well for an in-principle verifiable definition.

I think the part that ojm isn’t seeing is that the mode is pretty far away from where the rest of the probability mass is and so including it “stretches out” the set “too much” for it to be included in one of the “smallest” ones.

From an analytical standpoint, the transformation to the radial distance measure which gets you the chi-distribution certainly helped me to understand this issue. The chi density is zero at x=0 for all degrees of freedom greater than 1. Furthermore, for something like N=100 the density is essentially zero between 0 and around 7, so even a ball of radius 5 has zero probability mass inside it in 100 dimensions. So if you start with a sphere of radius 13 and subtract a sphere of radius 7 you’ll wind up with a shell of radius between 7 and 13, and 1-epsilon of the probability will be inside this shell… and so it really is the case that you should exclude the mode, because it’s not in the high probability region.

In R, if you do install.packages(“chi”) you can then in R do things like

curve(dchi(x,100),from=0,to=20)

and see what is going on.

@Bob

“the typical set is the high probability mass set by definition”

That’s not strictly true, which is my point.

@daniel “I think the part that ojm isn’t seeing is that the mode is pretty far away from where the rest of the probability mass”

No I see that. I tried to be somewhat clear in what I was and wasn’t saying above. Carlos is clearer

From Mackay:

> The best choice of subset for block compression is (by definition) [the smallest delta sufficient subset/high probability set], not a typical set. So why did we bother introducing the typical set? The answer is, we can count the typical set.

> As I note above, the typical set is the high probability mass set by definition.

What definition is that?

Great post, Bob!

Unfortunately many models in my field — computational neuroscience — do not (currently) work with Stan, since they involve numerical integration. See my post and great answers (by Bob and Michael) on the Stan User mailing list, exactly one year ago (https://groups.google.com/forum/#!category-topic/stan-users/general/-gaLDaepCwQ). Any news on that side?

At the time I tried Goodman and Weare’s walkers (Aslak Grinsted’s implementation in MATLAB) and was quite unimpressed with the performance, as the G&W method failed miserably already on a mildly correlated multivariate normal (rho = 0.8) in D=15 dimensions.

Since then, I hacked together a few tricks that yielded a black-box, derivative-free ensemble MCMC method that seems to work *much* better on my problems (still, I wouldn’t go above D~20, and care must be taken in the parameterization). I was already happy with that, but I was curious to benchmark the method on some standard problems (see my post on CrossValidated here: https://stats.stackexchange.com/questions/204559/performance-benchmarks-for-mcmc). At the time I couldn’t find a standard benchmark set of target functions for MCMC (as there are for, e.g., optimization), so I kind of left that on the side, with the plan to do some more thorough testing later.

If you have any suggestion on how I should test my MCMC method to see if/how it performs on other problems (I am definitely not making big claims), or any advice about benchmark target functions, please let me know! My plan was to build a set of benchmark functions, e.g., from multivariate normal to Neal’s funnel, etc., for D=2..20; and then measure various observables such as accuracy/precision of quantile estimates; mean/SD/covariance; effective sample size, etc., as a function of number of function evaluations (what I care about), for a bunch of derivative-free (ensemble) MCMC methods. BTW even though people use it I don’t really trust ESS alone, that’s why I want also to directly compute other known quantities.

It looks like a lot of work — but I am surprised how such comparisons between methods are not standard in the field (as they are, e.g., in optimization). Most papers that introduce a new MCMC algorithm delineate the theory, and then (maybe) test the algorithm on a couple of functions, sometimes not even against competitors.

We’re building out such a set ourselves, but it’s a much more subtle problem than in optimization, where you can measure (relative) distance to the right answer and there are many systems that give you the right answer on a wide range of problems. It turns out none of the “reference” methods in MCMC give you the right answer (Gibbs, Metropolis, etc.)

in a finite amount of time. Sure, they give you the right answer asymptotically, but we don’t have that long in real applications.Definitely we want to compare effective sample size (ESS) per unit time, but (1) that will vary by parameter and we’re only using estimated ESS, so there’s estimator issues to worry about, and (2) it only makes sense to measure this if the the sampler has converged. For example, Stan uses a conservative cross-chain and split-chain ESS calculation that will downweight ESS if samples corresponding to multiple chains don’t have the same sample means and variances. But samplers not actually converging (while nevertheless passing the non-convergence diagnostics) has been a problem historically with samplers like Gibbs and Metropolis.

One shouldn’t trust the street wisdom as conveyed by responses to that statsexchange thread—all of the samplers (including Gibbs and random-walk Metropolis) are bad if posteriors are multimodal. Some specialized samplers can deal with

very limited multimodality, which is not the kind of multimodality people usually care about, such as selecting the number of components in a mixture model or other variable selection problems, or combinatorially multimodal problems like clustering.For validation, you can look at things like the Cook, Gelman and Rubin approach (there’s also a paywalled published version, but I’d rather link to free versions of papers).

Thanks Bob for the detailed answer.

Looking forward to seeing your benchmark problem set for MCMC then.

And I hope that you’ll consider adding integration and derivative-free MCMC algorithms to Stan.

In the meanwhile, I’ll have to make do with other methods.

I appreciate that benchmarking MCMC is *much* more complex than optimization; although in my opinion it still does not fully explain the relative lack of benchmarking in the literature so far (I guess people still did it, just less formally).

For arbitrary functions, one wouldn’t know the ground truth. However, for a subset of carefully built functions (e.g., Neal’s funnel), it’s possible to compute summary statistics analytically (or numerically, e.g. by exploiting symmetries and reducing the problem to some complicated but tractable numerical integration). It’s not perfect, but seeing how the RMSE (across different runs) of various summary statistics changes as function of fevals (or time) is still somewhat informative.

As far as I am concerned, true multi-modality is way beyond what can be done at the moment without exploiting knowledge about the problem (a priori knowledge, or gained by fiddling a lot with the model).

It’s great that you have a (relatively) reliable way to compute ESS. All estimators I have tried so far did not look very reliable.

Thanks for the link to the paper.

This is the old warning that most of the volume of a multidimensional orange is in the skin, right? Or as one ML guy I follow on twitter put it:

https://twitter.com/dribnet/status/703870731810025473

“curse of dimensionality, lemma 143: all random vectors are about the same length and the mean of two random vectors is extremely unlikely”

So in doing interpolation in feature spaces, it’s now popular to use spherical linear interpolation instead of straight interpolation.

Yes, same old same old. But it bears repeating as more people start adopting these methods in high dimensions without a lot of geometric intuition about higher dimensions.

Even better than spherical linear interpolation or elliptical interpolation? Going with the Hamiltonian flow. The conservation of the kinetic plus potential energy means you’ll be going very fast around the mode (high kinetic, low potential) and thus will zip right by it rather than being likely to stop.

Hey, we use differential evolution in cognitive modeling (race models). Observations tend to have at least 6 parameters, but with more complex designs (i.e. difficulty of the stimulus, bias, more than 2 choices, rewards etc.) you can end up with many more in the model.

It seems to work well for us, and I remember the main argument for why we use DE-MCMC is that it (allegedly) does very well in highly correlated spaces and is super fast with over multiple cores (especially multiple subjects as fixed effects).

Just looking at a pairs plot I have up now and there are some parameters with R^2=.99, a bunch around .9 and a few more at about -.7. The majority of parameters have absolute R^2> .7. I’m pretty sure those are R^2, not R values- the code says corr.

One thing we do with DE, which I don’t think was covered in the original paper is that we have a migration stage during warm-up.

I’m not sure exactly what the algorithm is but basically during warm-up if there are stuck chains it will migrate a stuck chain to be near other chains with some low probability (we use .05 by default). If the migration probability is too high then everything immediately just migrates to be near each other, which is worse than stuck chains as you never explore the likelihood surface. Although in the doughnut example I don’t think migration would help you at all.

I’ve been thinking for a while that I should compare our code to STAN as a benchmark. Does anyone have any comments on how STAN performs with highly correlated posteriors?

Also, is it possible that an ensemble method would actually introduce correlations into the posterior?

Cheers

Oh I should make it clear that the migration turns off for the majority of the sampling.

“One thing we do with DE, which I don’t think was covered in the original paper is that we have a migration stage during warm-up.

I’m not sure exactly what the algorithm is but basically during warm-up if there are stuck chains it will migrate a stuck chain to be near other chains with some low probability (we use .05 by default). If the migration probability is too high then everything immediately just migrates to be near each other, which is worse than stuck chains as you never explore the likelihood surface. Although in the doughnut example I don’t think migration would help you at all.”

NO NO NO NO NO! Chains getting stuck is a clear indication that

you are not exploring the entire posterior and hence getting biased

inferences. For some further discussion with an example, see

http://mc-stan.org/documentation/case-studies/divergences_and_bias.html.

This migration just masks this bias, it doesn’t fix it!

When talking about performance you have to consider not just the apparent

speed but also whether or not you’re getting the right answer. Only when

everything is getting the right answer can you compare speeds.

“I’ve been thinking for a while that I should compare our code to STAN as a benchmark. Does anyone have any comments on how STAN performs with highly correlated posteriors?”

[Stan, not STAN]

Yes, Stan does very well with highly-correlated posteriors. Even in

thousands of dimensions. And it will tell you when it encounters

pathologies to be worried about.

“Also, is it possible that an ensemble method would actually introduce correlations into the posterior?”

It’s not that it might induce correlations, it’s just that it might not

adequately explore and with not obvious indications. See above.

Seems fair. I have to admit that the strategy of going- I don’t like those chains so I’ll move them did always seem a little worrying.

I think it’s a case that we’re so far down the DE route with so much code and automatic checks and the ability to easily design models and have it all run so easily that even psychologists can use it (I did my undergrad in maths), that it’s getting harder and harder to switch algorithms. I could probably turn my whole PHD into- which sampling method should we be using?

I keep meaning to do more work in Stan on the side (as a sidenote- when you guys google Stan, do you get results for the Australian streaming service also called Stan? I wonder if there is a way to tell google I always mean Stan the software).

There was a tutorial using Stan with cognitive models but I think the syntax has changed completely since.

https://www.researchgate.net/publication/303905613_Bayesian_inference_with_Stan_A_tutorial_on_adding_custom_distributions

Hey Michael, I watched your video from Stan Con. So is the basic problem with DE is that it assumes you are trying to explore the region inside the shell, when actually you should explore the external of the shell?

When I was first showed DE someone drew a shell looking thing and then went, “How do we explore whats in there?”

@Bob

Off topic, but why isn’t there a “simple” way to install R (on windows). Why cannot I just do a install.packages(“foo”) and be done with it?

Cannot all the separate steps on this page be bundled into some sort of Windows installer? i.e. check R version; find appropriate Rtools version; download; install; set paths manually; verify rtools can be called from R; verify g++ can be called from R; the strongly recommended config section; makefiles etc.

https://github.com/stan-dev/rstan/wiki/Installing-RStan-on-Windows

Is this answered in the FAQ somewhere? Is there some fundamental reason why all this nuts and bolts details cannot be automated? Or is there just not enough interest to do it.

As it stands Stan seems like the most difficult to install of all the packages I have on my R installation. I’m sure you’d get a lot more adoption (not that you are any short as it is!) if only the Stan + Rstan installation was a one liner.

No offence meant. Just a constructive suggestion.

Rahul, I think you’re discounting how much of a PITA this sort of thing is and how few Stan developers are actually paid to do Stan maintenance programming. Doing something like this reliably and consistently is time-consuming. It’s pretty fair to say the current contributors sink their time into the pieces they can see some benefit from. That said I have no idea how much of a priority it would be for the project—I personally don’t care :)

Oh, I totally agree *if* it is that reason.

I am just curious as to what the reason is. Sure, I see your point it’s not like I can ask for my money back! :)

I’m just curious as to why it has to be so difficult in Stan’s case.

To put it another way, I’m not an irate customer. Just a curious observer.

I think relatively few people who do scientific computing do it on Windows. So probably installing on Windows isn’t a thing that has people clamoring for an easier install.

When it comes to an out of the box Debian Linux system, I can pretty much do install.packages(“rstan”) and be done, and this is because all the g++ compilers and libraries and boost and whatnot are either already there or I can do “apt-get install xxxx” for a list of things suggested on the stan website and never think about it again.

I love Linux too. I had the best of both worlds for a long time with a Win laptop n multiple ssh sessions signed on to a Linux server at the Uni. Very convenient.

I’m now contemplating running a Virtual Box Fedora instance on my Win laptop (dual boot is useless for workflows needing switching between Win n Linux).

But last time I tried the virtualization just killed the speed and responsiveness to the point of un-usability. Perhaps I need a beefier laptop.

Anyone with experience on this? Would love to hear.

@Daniel Lakeland: Surprisingly, I suspect most of our users are on Windows because they’re applied scientists or in business, not statistical computing experts. Of course, zero of our core C++/R/Python developers use Windows (I don’t know about the interface devs).

@Rahul: I completely agree that installation bottlenecks are seriously hurting our users.

We haven’t had much luck in automating scripts for installation on R—there are just too many moving pieces for us to accomodate everyone’s install as it is. It’s easy enough to bundle everything into a standalone Docker image, but that’s not what our users want. None of us know how to build the kind of installer you’re asking for.

The bottleneck is almost always getting Rcpp installed and talking to a compatible C++ toolchain. RStan actually doesn’t need to install Stan independently—it brings it in through the R Stan headers package. Once Rcpp is installed, installing RStan is usually pretty straightforward as it’s just a pile of R code. Usually users don’t need to install Rcpp for packages that depend on Rcpp, because it only uses the binary bits. We need to use Rcpp and inline at runtime in order to compile the C++ derived from translating Stan programs.

What can be problematic is not having the right versions of other software. It’s generally considered rude to have your installer steamroll your existing application. Definitely not something that can be distributed through CRAN. Also, I’m told that people only trust CRAN installs and aren’t going to run random scripts from projects like Stan (for all the usual safety reasons).

If someone wants to build us a Windows installer that works, that’d be fantastic. We take pull requests! We’d be happy to help with the Stan-related bits of the process.

@Bob, I don’t know for sure of course, but it’s become way way more common for people who do data analysis etc to have Linux even in a business environment with “All Windows” on the desktop. Lots of times I think it’s just having your windows machine connect to some “linux in the cloud” but I could be wrong. Certainly from what I know of say biologists in academia (including the research parts of say medical schools and hospitals) the typical situation is a laptop running Mac OS connecting to a Linux machine in a closet or under some grad students desk or in “the cloud”. I’d be surprised if in places like pharma companies it was fully dominated by running Stan directly on their desktop Windows machine instead of some Linux cloud computer. I certainly don’t think it’s ALL Linux in those environments, but the cloud computing stuff is making this more common than it was in say the late 90’s through mid 2000’s

Thanks Bob!

I wonder if any other R packages face this same problem that you do with Stan? It may be interesting to see how they handle it.

Personally, most of what I use works with just a install.packages(“foo”)

ojm: “You referenced Mackay. Compare his smallest sufficient subset [with] his typical set. Different objects with different definitions. The former is ideal [the] latter is effectively identical to it but not exactly. Same thing we’re discussing.”

Right, so Mackay defines the smallest sufficient subset as a subset of the base space X. In the context of the unit normal he’d be saying something like start at x=0 and move outwards symmetrically until you cover 1-epsilon of the probability. if you set epsilon = .05 you wind up with more or less [-2 to 2], whereas the “typical set” is actually defined over the set of sequences in X^N where this means the product space of N draws from X.

So when Bob and Michael talk about “moving around the typical set” what they are saying is …. an abuse of terminology, because the typical set is a set in X^N not a set in the space X.

But, their goal is to construct a *single* point in X^N such that it is a good sequence of draws, and doing so requires them to spend their time moving through a particular subset of the space X. And the subset of the space X that is good for the purpose of generating a sequence from X^N which is good… necessarily excludes the origin in a 100 dim normal for example.

Imagine you have an RNG like rnorm(N,1,0) except that it doesn’t use independent draws, it uses HMC. You can input any N you like. So you put in 30. Now the condition you want to satisfy isn’t that each of the 30 is itself a good independent unit normal draw, because you pretend that no-one knows how to generate unit normals (in the real application we’d like independent draws from the posterior, but it really is true that no one knows how to generate those independent draws for most real world posteriors). Instead the condition you want to satisfy is that for virtually any function f(x) sum(f(x_i))/N ~ integrate(f(x)dnorm(x,1,0)dx,-inf,inf) and that this is true for every random seed you use and N sufficiently large.

This condition is satisfied by sequences drawn from the typical set of X^N and gets to be a better approximation for larger N. So the typical set is exactly the set of “good” RNG *sequences*. The fact is that the typical set in the sequence space also contains essentially all of the probability of the sequence space.

Now, what is the name for the subset of the base space that you have to move around in in order for your sequence to be in the typical set? Well, as far as I know, there isn’t one. In fact, what you need to do in HMC is move through every subset of the space in such a way that the total time you spend in that subset is proportional to the total probability mass of that subset. So the time spent inside the sphere of radius 5 for a 100 dim unit normal…. is negligible precisely because the probability to be inside a sphere of radius 5 is negligible.

It’s not that they are “excluding” some region, it’s just that naturally they spend negligible time in regions where there is negligible total probability.

I’m fairly happy with the summary by Carlos above (and my minor modification of it). I think it reconciles the various intuitions sufficiently well, while sticking to clear(er) definitions.

Even when dealing with discrete distributions one should distinguish between the probability of any individual point and the probabilities of various collections or regions of points.

Again, sticking to discrete distributions, the definition of the high probability set is entirely coordinate free and does include the mode. The typical set typically doesn’t. So what? They are different objects with different definitions.

When sampling from a discrete distinction I see no reason not to include the mode, it’s just very unlikely – because all *points* are very unlikely in high dimensions.

That doesn’t mean it isn’t still a good idea to avoid being stuck in *regions* near the mode in practice.

Right, it’s not so much that you exclude anything in HMC it’s just that naturally you spend no time near the mode in high dimensions because a region with large diameter around the mode has no important probability mass in it. If your sampling scheme does stay in the vicinity of the mode, it will produce a sequence of draws that isn’t in the typical set, and that sequence will fail to give you good approximations to the expectation.

The use of the term “typical set” on this blog is I think unfortunately confusing compared to any kind of mathematically precise definition, which is actually in terms of sequences not locations in the base space.

+1. HMC doesn’t exclude the mode by design. It happens because when you are accumulating probability mass (say to estimate an expectation), there is very very little there. This has implications for any method that deals in probability mass, which is what you want. Seriously, Betancourt’s tutorial is the best on this: https://arxiv.org/pdf/1701.02434.pdf

@chris

This is still missing the very simple, very pedantic point Carlos and I were trying to make.

Read the actual definition the typical set.

In short – the typical set provides an arbitrarily good approximation to the high probability set, the latter of which is presumably what a Bayesian wants to compute the ‘best’ approximate expectations with.

One set includes the most probable *point* (lets stick to cases where we deal directly with probability mass), one doesn’t. Excluding one point in high dimensions doesn’t make any difference because excluding *any* point in high dimensions doesn’t make any difference.

When you talk about regions around the mode you are not talking about the mode point but a set of points. This set may or may not be any good. See high dimensional oranges, or whatever.

I am completely fine with the idea of computing expectations with the typical set. I don’t think I ever said it was a bad idea.

OK, yes I see your ‘point’, where it makes sense to talk about a point having a probability (i.e. discrete distributions). However, I think Bob Carpenter and Michael Betancourt are completely right to emphasize the typical set as much as they do (even if they use it slightly colloquially) to counteract the false (or absent) intuition about probability mass and expectations a lot of folks pick up from learning max likelihood first…

Points don’t have probability mass in continuous spaces, only volumes do. And the volume around the mode is so small relative to further out that the volume around the mode has very little probability mass. The probability of generating a point at random from a distribution in a given volume is proportional to its probabilty mass. So the volume around the mode isn’t any good if any good means having any probabilty that a point generated at random will fall in that volume.

Discrete spaces work the same way. If you have a chance of success of 0.7 in 100 i.i.d. binary trials, the probabily of seeing the mode, which is 100 successes, is very low. Here the notion of “volume” is permutations, and for a count of n, the “volume” is (N choose n). This is a bit tricky, because in 100 Bernoulli trials, 100 successes is very rare, despite 100 successes being the mode. In a binomial, the “volume” figures in explicitly because the outcome is just a count, and so 70 is the mode, not 100.

This has nothing at all to do with Bayes. Maybe defining a simple simple statistic will help. Suppose $latex y \sim \mathrm{Normal}(0, \mathrm{I})$ where $latex y \in \mathbb{R}^N$. Then define a statistic $latex f(y) = y^{\top} \, y$, the square distance from zero. It turns out this statistic is distributed as a standard chi-square distribution with N degrees of freedom if we are working with an N-dimensional distribution. The mode is a ridiculously distant outlier for this statistic. Drawing a point anywhere near the mode is so unlikely in high dimensions, that we can reject a null hypothesis that $latex y$ was drawn from a standard normal with a riduclously low p-value.

If we try to comptue expectations using MCMC,

$latex \mathbb{E}[f(\theta)] = \int f(\theta) \, p(\theta) \, \mathrm{d}\theta \approx \frac{1}{M} \sum_{m=1}^M \theta^{(m)}$

and we include a lot of draws from near the mode, we will get the wrong answer. It’s not that we’re excluing the mode explicitly in any way for Monte Carlo (or Markov chain Monte Carlo) calculations for integrals, it’s just that draws near the mode don’t arise in random sampling.

ojm: I did read the definition of the typical set and it’s a definition of a set on X^N the product space of N length sequences output by a sampler, so it really has nothing directly to do with *where in the X space* an individual sample found by the sampler should lie. This suggests that Michael and Bob’s usage of the term is colloquial rather than technical.

@daniel

Not really disagreeing with you in particular anymore, I think we’re roughly on the same page :-)

I didn’t want to go into a ton of math in the blog post, but I should point out that this isn’t a vague definition that we’re making up. Here’s a very nice intro (see definition 6, in particular):

And that’s from a senior level undergrad zoology class (though it is U. Washington)! Maybe we’re looking in the wrong departments for postdocs?

Bob, I think ojm was confused because when you say that you’re sampling from the typical set, technically the typical set isn’t a subset of the N dimensional parameter space (ie. a place in the parameter space where the HMC scheme should be flying around), it’s a subset of the N x K dimensional set of K-length sequences of N dimensional vectors that you get when you ask Stan to draw you K samples from the N dimensional posterior distribution. In your definition 6 it’s a sequence of N draws that is being used, not a single draw.

So when you say that “The mode isn’t in the typical set” the only thing you could really mean is something more like “sequences of length K that include points near the mode even once are not in the typical set of sequences of length K”. And, Stan’s job is to draw you a sequence of K parameter vectors that is in the typical set of K length sequences… so it’s a very relevant issue for a sampler.

Also, I guess for sequences of length 1 you could maybe define a typical set, but the typical set is most valuable when considered for long length sequences, because its properties are asymptotic for longish sequences. A single draw from a high dimensional space is not going to get you very good expectations for anything.

So, anyway, I think that’s where the confusion comes from.

@Bob

That’s exactly the definition of typical set I assume s you were referring to, and it is in fact distinct from the high probability set.

I’m not confused on your general point and the general correctness of what your getting at, I’m just pointing out a technical, pedantic point that I think you’re missing.

Carlos has made the same point above. Daniel has too, essentially.

Ugh…you’re getting at…etc etc

ojm, is another way of phrasing what you’re getting at, that for an arbitrarily small volume in continuous parameter space, the integral over the density is technically highest at the mode? But then, as Daniel Lakeland says, you want to length K sequences (where K is sufficiently large) to actually compute expectations, and then we go back to talking about accumulating probability masses over relevant volumes (which of course grow with dimensionality) and how any good sampler will need to stay in the typical set?

I guess I still don’t see how this relates to Bob Carpenters criticism of ensemble samplers, but I’m not sure it’s very important that I understand this, so no worries if we just want to move on ;)

edit for clarity: by “this” I mean the subtle distinction between ‘high probability set’ and ‘typical set’ that you are discussing.

Carlos gives a good summary below.

> the typical set doesn’t define the smallest volume that contains “practically all” the distribution. You can get a slightly smaller set with the same probability mass by adding the core of the hypersphere and scrapping a thin layer from the outer shell. That’s the high probabilty set

Does this have implications for the sampler discussion? I don’t know. I do know that one of Bob’s premises was that we want to sample from the typical set. He further seem to emphasise that sampling the mode was inherently bad (eg ‘by definition’ of the typical set).

These premises seem technically false to me. Whether that’s enough to save ensemble samplers is a different question.

OK this plus Bob Carpenters note below clears this up for me. Thanks!

For those who still (or ever did) care, the notes referenced above state that the definitions are taken directly from Cover and Thomas. Funnily enough, Cover and Thomas have a good, explicitl discussion of high probability sets vs typical sets. The connection is, of course, the AEP which explains why typical sets ar a good approximation to high probability sets.

Thanks ojm! I thought everyone was confusing density and probability mass because of the focus on the mode. This will really help clarify the case study on high dimensional mass vs. volumes and the typical set I’m writing up.

I went back and read the relevant bits of Cover and Thomas and see that the typical set and highest probability set aren’t the same.

The high probability set (for a given epsilon) is the set with the smallest volume with probablity one minus epsilon.

The typical set is defined as the set with one minus epsilon probability centered (in the sense of containing log density level sets plus or minus some value) around the (differential) entropy (expected log density of a draw).

I think we all agree that when we take a random draw from a standard (unit covariance) multivariate normal high dimensions, we have a 1 – epsilon chance of getting a draw in either set—that’s just the definition! I think we also agree that the draw is astronomically unlikely to be anywhere near the mode, despite the mode being in the high probability set and not the typical set. Of course, the draws are not likely to be anywhere near each other, either.

Of course, in some densities, like uniforms, the typical set covers the whole interval (every point has same density, so they all have log density exactly equal to the expected log density, and the high probability set isn’t well defined.

Thanks again—this was really helpful.

Well, this discussion ended far more positively than typical internet discussions! ;-)

Thanks for the case study and helpful comments – I wasn’t intending to nitpick, just genuinely interested in understanding the issues at stake.

Now to actually try Stan properly…

….or alternatively, the zoology dept. is teaching its undergrad stuff of questionable utility. What use is any of this to a typical zoology undergrad? I’m straining to see any application. And that too, of the rigor and theorem-proving which seems in those lecture notes.

Evolution at the micro (genetic) and macro (population) levels. At the cellular level, the coding is apparent—cells are just little thermodynamic computers that run discrete processes with proteins (and micro-RNA and whatnot). You even see a lot of information theory in language evolution models, such as how the sounds of a language code discrete symbols and how changes are likely to happen (for example, irregularities go away over time unless they’re very common words, which is why English still has case in its pronouns and a grab bag of odd auxiliary verb structure that’s person sensitive in a way no other verbs are).

The author of those class notes wrote a paper on the topic (just saw the intro—didn’t read it):

http://octavia.zoology.washington.edu/publications/BergstromAndRosvall09.pdf

>>>The author of those class notes wrote a paper on the topic (just saw the intro—didn’t read it):<<<

This last bit has been a pet peeve of mine. At least in Engineering I sensed a tension between teaching students (esp. undegrads) what serves *students'* interest the best. Versus teaching topics that the Prof. is most comfortable / familiar with (e.g. has published a paper on recently or has a grant for).

I knew of at least two Profs. who taught undergrad courses significant chunks of material closely aligned with their work but of very marginal utility to the typical undergrad. The opportunity cost is the crowding out of other (more useful) material.

But I guess utility is always subjective. But I wish more Profs. kept this discord in mind when choosing teaching material / courses.

Ok, if we restrict ourselves to N-dimensional probability distributions generated from a sequence of N i.i.d. random variables we can define a typical set. But I’m not sure if the concept is supposed to be valid for arbitrary probability distributions in N dimensions. And after reading Daniel’s comment I’m not sure if you’re talking purely about the N-dimensional distribution (i.e. the sequence of coordinates) or if there is also a sequence of N-dimensional points involved somehow.

And, as ojm mentioned, the typical set doesn’t define the smallest volume that contains “practically all” the distribution. You can get a slightly smaller set with the same probability mass by adding the core of the hypersphere and scrapping a thin layer from the outer shell. That’s the high probabilty set.

In information theory the typical set (sequences with sample entropy “around” the entropy of the random variable) can be useful to obtain interesting results. Are you defining the typical set here just for illustration or does it have any practical implication?

Yes to all three questions.

Yes, there’s a general definition (assuming continuity and discreteness and I’m guessing a measure theorist could generalize).

Yes, you’re right that the typical set isn’t the smallest volume set containing one minus epsilon total probabilty mass.

Yes, the typical set is a useful concept because it illustrates where random samples actually fall. By that, I mean that if you take a million draws from a 100-dimensional normal and plot their log densities, all of the draws will have log densities like those in the typical set, not like those of the mode. We regularly get the question on the Stan mailing list of why draws from the posterior don’t have log densities anywhere near as good as the max a posteriori estimate and doesn’t that imply Stan is broken because it’s not producing the “best” answer. That’s why I’m writing the case study.

It’s so nice to write “yes” rather than “no”!

I hope you can describe this general definition some time.

The usefulness of the typical set that you describe is also present in the high probability set. If you take a million draws, etc, the will have log densities like those in the high probability set. Does the typical set have any additional advantage apart from making the “skin of the multidimensional orange” point more obvious?

This is really the crux of the issue for me too – and I wonder if this has any implications for algorithm design or not.

‘This’ meaning whether the subtle distinction could lead to different algorithm design or not. Or have implications for improving rather than discarding ensemble samplers etc. I’m not really sure either way.

As Bob said, the typical set is defined in terms of “those points where the negative log density is within epsilon of the entropy” since Stan moves around a hypersurface defined by the negative log density, the idea is that all the Stan samples should have negative log density right about at some constant +- epsilon and that it should be very different from the log density near the mode, which is also the log density found by optimization… so I think in the context of explaining Stan, emphasizing the definition of the typical set in terms of the size of the negative log density could be helpful.

If you take independent draws at random from a high-dimensional multivariate normal, they will all fall into the high probabilty set and all fall into the typical set. But if you look at the max log density you get from 1000 draws (as I plotted above for a 100-dimensional normal), or even 1,000,000 draws, you won’t get a draw anywhere near the log density of the mode. You can use the chi-square inverse CDF to calculate the tail probabilities of (squared) distances to the mode.

This has nothing to do with Stan! It’s just how sampling works in high dimensions because there’s essentially no probabilty mass around the mode, so you never get draws from near the mode.

There are deeper implications for sampling. In general, if you’re computing an expectation from a Markov chain (or even from independent draws), you can replace the draws with draws that have the same expectation and ideally lower variance. For example, suppose you have a Markov chain and instead of thinning (keeping every 1000th draw or whatever you have to do to get Metropolis output into memory), you average (replace every 1000th draw with its average). The second sequence is not distributed according to the original distribution. It has the same mean, but lower variance. So you can use averages of draws to compute expectations. That’s what Goodman and Weare go over in their paper. You just can’t save these averaged draws and compute expectations for new nonlinear functions of the parameters after the fact—you have to save the averages for functions you’re computing. And you can’t use the averaged draws to compute quantiles in the usual way.

I should add that there are also implications for Euclidean Hamiltonian Monte Carlo, which in each iteration explores a level set with a fixed Hamiltonian (potential plus kinetic energy, where the potential is the negative log density and the kinetic is random standard normal). With random standard normal kinetic energy in N dimensions, the log density can change at most by a chi-square(N) variate. Michael goes over this in some of his early papers on the principles of HMC. Riemannian Hamiltonian Monte Carlo mitigates this problem, which is why it can explore the funnel density efficiently.

Interesting discussion. The original post tars differential evolution (DE) and the Goodman & Weare affine method with the same brush. But does DE have the problem outlined by the R simulation? I don’t think it does. The asymptotic acceptance rate in DE is supposedly 0.23 for large dimensionality, whereas the problem here implies an acceptance rate near 1 (because of the vastly higher posterior values).

Something I’m not clear about still-

In Michael’s Stan Con lecture, he talks about typical sets and expectation. The posterior is in the integral for the expectation. But I thought we were using MCMC to just get the posterior distribution. Going back through Gelmans BDA3 he says that we want the stationary distribution of the markov process to be p(theta|y).

Michael argues that points away from the typical set don’t give much to the expectation, so we don’t need sample them to calculate it. But does that means there is some extra step at the end to take that expectation and get back the posterior distribution? Is this a more optimal way of estimating the posterior?

Angus: we’re using MCMC to get a set of points which are “as if” sampled independently from the posterior distribution (at least for the purpose of computing expectations, which are of course insensitive to the order in which you sample things).

The typical set gives you ONE way to describe what a sequence of samples “should look like”, namely that the probability of the sequence should be very close to exp(-N*H) where N is the number of samples and H is the entropy of the posterior.

Thanks to the way that HMC is designed, the only step you need to take at the end of your run is to go ahead and compute a sample expectation from the sample you get.

It just so happens that in high dimensions, independent random samples from the posterior have the property that they never in a gazillion years include the points right near the highest density location. This is just the math of high dimensional spaces, and HMC, even though it isn’t a process for *independent* random sampling does give you a sample which has this same property. It does so without explicitly “excluding” this region, it just is the case that HMC stays away from this region just like independent random sampling does.

Hope that clarifies things.

Angus, I don’t think Daniel’s reply addresses your lack of clarity on MCMC (which doesn’t seem to have anything to do with regions of high density).

If I understand you correctly, you see a tension or contradiction between BDA3 focussing on how we want MCMC to give us (an estimate of)

the posterior distributionand Michael Betancourt saying that we’re using MCMC to give usposterior expectations. That’s why you’re asking if there’s a extra step at the end to transform the expectation back into the distribution.There are two points of confusion here. They are: (i) the reason why we want (samples that look like samples from) the posterior distribution per BDA3, and (ii) exactly what Stan does to get all those nice expectations and quantiles it computes from its HMC sampler. (i) The reason why we want those (samples that look like) posterior samples is because actually our end goal is to compute posterior expectations and the samples give us a good way to do that. (ii) Stan uses the HMC sampler to produce samples and then it uses those samples to compute (well, estimate) the expectations and quantiles. So there’s no extra step to get

backthe posterior distribution — you can just make Stan give you its samples and then you can use them to estimate whatever posterior expectation is of interest to you. Stan automatically reports some canned expectations that are almost always the thing we as statisticians are interested in.None of this has anything much to do with typical sets; that’s a red herring relative to the things you weren’t clear about.

@angus

As far as I am aware, in all/most of these large-scale Bayesian inference problems it is assumed that all/most questions are formulated in terms of computing the posterior expectation of some function. Essentially, one is looking to extract information in the form of functionals of the posterior – these could be the mean, median, variance etc of the posterior, or the expectation of some other function.

Thus one doesn’t look at the (high-dimensional) posterior itself, rather a collection of ‘properties’ of the posterior defined in terms of expectations.

+ 1. I will add that for some smaller-scale problems, I still get a lot out of plotting prior vs posterior densities using draws from Stan. Reading Betancourt’s stuff has made me appreciate though that the only *mathematically well-defined* operation to do with these densities is integration (i.e. compute expectations).

We can’t actually cover the posterior in high dimensions. Think of it this way, there are 4 sign possibilities in 2 dimensions, 8 in 3 dimensions, and in general $latex 2^N$ in $latex N$ dimesnions. No way to even get a draw in each quadrant in 100 dimensions. At most, we can hope to compute expectations. And the marginal distributions in fewer dimensions can have coverage, just not jointly.

So yes, we do want to take posterior draws from $latex p(\theta \, | \, y)$. But no, we can’t cover the posterior by so doing.

And when we take draws from the posterior, they pretty much all fall in the typical set by definition.

Hi Bob,

I really like this blog. Thank you for contributing to scientific discourse in this way – it’s important, and much appreciated.

I have assigned the above blog entry as a reading for the journal club that I run, on the suggestion of one of the PhD students who attends the club. We’ll discuss this next week, but I had a couple of issues I’d like to be clearer on before then.

Starting with the “executive summary” of five points in your blog, I wonder if you would mind elaborating on the following:

1. Your point #1 is that “we want to draw a sample from the typical set”. The difficulty I have understanding this is that any Markov chain which has reached is stationary state is obliged to draw samples from the typical set with probability proportional to the mass in that set (by definition of the chain). Obviously, the chain will also draw samples from outside the typical set (e.g. from the mode) with probability proportional to the mass in that subset. So, both HMC and DE (and the rest) are bound to draw samples from the typical set with exactly the same frequency. I think I must have missed something here. Perhaps your point #1 is intended to relate to shorter subsections of the chain, where the large-number limits don’t apply?

2. Your point #4 is that “the only steps that get accepted will be near one of the starting points”. This seems to restrict attention to proposals in the typical set. However, proposals near the mode will be accepted with high probability. These proposals are also generated with high probability by particle-based methods, which means that many proposals *other* than those near the starting point will be accepted.

3. If we take my point #2, that lots of proposals are generated and accepted outside of the typical set (e.g. in the region of the mode), then I can’t quite see how the sampler will devolve to a random walk (your point #5). Could you perhaps elaborate?

Given these issues, I investigated the properties of proposals generated by Ter Braak’s differential evolution, using simulation. I adopted your blog’s set-up, with a K=100 dimensional multivariate normal, with identity covariance matrix and zero mean. For this, the distance of samples from the mode follows (the square root of) a chi-squared distribution, with df=K. Suppose we define the typical set as the inter-quartile range. For K=100, that typical set is the range between about 9.5 and 10.5. Just as Betancourt points out, this typical set is quite narrow, and also is well away from zero.

Now, by definition of the inter-quartile range, 50% of samples from the target distribution fall in this typical set. So I wondered was the corresponding probability was, for proposals generated by Ter Braak’s differential evolution. I used the setting of lambda which has become standard in many applications: 2.38/sqrt(K).

In simulations, about 38% of these proposals fall in the target density’s typical set. That seems pretty good, right? Not that much smaller than for samples directly from the target distribution. I’ve put code for the simulations below, in case someone can find a mistake there.

I think I must have missed something crucial about the points in your blog post, and those that Betancourt made in his talk. Perhaps the problems being discussed, with particle methods, were about the probability of a particle method finding the typical set from bad start points? My simulations don’t address that, and maybe that’s what you were meaning.

Any pointers would be much appreciated.

Scott.

### SOME R CODE FOR THE ABOVE ###

K=100

n=1e4

# Generate some samples from the target distribution.

s1=array(rnorm(K*n),dim=c(K,n))

# The distance of each sample from the mode is:

d1=apply(s1^2,2,sum)^0.5

# Define the IQR as the “typical set”.

IQR=qchisq(p=c(.25,.75),df=K)^0.5

# For K=100, that is about (9.5,10.5).

# Marginal probability of a sample falling in the typical set is 0.5, by

# definition of the IQR. Now lest see what it is for DE proposals.

lambda=2.38/sqrt(K) # Standard setting.

s2=array(rnorm(K*n),dim=c(K,n)) # More samples.

s3=array(rnorm(K*n),dim=c(K,n)) # Again more.

p=s1+lambda*(s2-s3) # Differential evolution proposals.

dp=apply(p^2,2,sum)^0.5 # Distance of proposals from mode.

# What is the proportion of DE proposals that fall inside the

# typical set of the target distribution?

in.typical.set=(dp>IQR[1])&(dp<IQR[2])

print(mean(in.typical.set))

# Visualise how the distribution of distance-from-mode differs between

# the target distribution and the DE proposals.

plot(density(d1),main="Distance of samples from mode.")

lines(density(dp),col="blue")

abline(v=IQR,col="red")

legend(x="topleft",legend=c("Target Dist","DE Proposals"),lty=1,col=c("black","blue"))

# It looks like the proposals from DE cover the typical set pretty well.

I think the problem is that the points generated are relatively close. In your example, the distance between the base points and the proposals is around 3.5 and very rarely above 4.5. This is almost half the radius of the hypersphere, we could say it’s of the same order of magnitude. But it seems minuscule if we compare the area “covered” in a single step with the total space, so it will take forever to explore it. The ratio of the hypersurface of the hypersphere (taking conservatively R=9) to the area of the 99-dimensional “hyper circle” on the hyper surface (taking R=4.5, in the high end of the sampled values and ignoring that the tangential distance will be lower) is around 10e30 (and 10e40 may be a better estimate).

By the way, the same argument could be done directly in terms of the volume of the high probability region and the volume accessible at each step, instead of looking at the typical hyper surface.

Side point – has anyone looked at giving a notion of volume to the individual walkers in ensemble methods and then incorporating volume exclusion effects in the updates? Presumably this could help avoid crowding around the mode (but this is really just a naive thought).

This is what the Metropolis adjustment does. It’s actually dealing with volume in a subtle way to do the required Jacobian adjustment. The basic problem, as I keep stressing, is that interpolating or extrapolating among points drawn from the posterior is unlikely to be another point that looks like a point drawn from the posterior at random (and hence not in the typical set).

In fact the time to generate independent samples doesn’t depend on the dimension, so now I’m not sure there is a problem at all… At least in this example, if we ignore the burn in (and assume the ensemble is large enough) N=100 doesn’t seem to be problematic. What are we missing?

The idea here is that due to DE’s random-walk-like behaviour the effective sample size for a given amount of computation will be much smaller than that of HMC.

The idea here is that in high dimensions a random walk, even one that always accepts moves up the density, will have a hard time finding the region near the mode — volumetrically it’s a needle in a haystack. If you start near the mode, the random walk will wander away from it and eventually hit the typical set and then stay in that general vicinity.

I understood the original post as making a completely opposite point: that ensemble methods, or specifically the Goodman and Weare affine method, are too good at finding the mode and therefore inefficient at sampling from a distribution where most of the density lives reasonably far away from the mode.

This may in fact be the difference between “asymptotically for long time” and “in actual terms before I get tired of running the algorithm”

My intuition is leading me astray somehow — I was reasoning from the theorem that if you start with independent samples from the target distribution and step one of them through a DE-MCMC step, the resulting state has the target distribution as its marginal distribution.

That’s only true after the Metropolis adjustment (if I’m recalling DE correctly). What’s going to happen is that you’ll make a proposal, and unless it’s close to one of the original points, it will be rejected. Again, interpolations and extrapolations among points in the typical set are unlikely to fall in the typical set, hence they’ll get rejected by Metropolis because keeping them would lead to biased draws.

I was getting confused because I was thinking the top plot shows the log-density of proposals (and since they’re higher than in the typical set, they’d automatically be accepted). But that’s not how plain jane DE-MCMC works, which I ought to have remembered since I coded it up in Matlab one time. I was right the first time: as dimensionality increases DE-MCMC is going to have a hard time finding the direction that points up the log-density.

The proposition ‘”DE-MCMC proposal was accepted” implies “proposal was close to one of the original points”‘ isn’t true. (I’m not asserting that this was your claim.) DE-MCMC picks two points at random from the ensemble, takes their difference, scales it, and then the proposal is current state + scaled difference vector. So if the two points are far apart and one of those point is close to the current state then the proposal could be deep into the region of high density interior to the typical set. It’s just that in high dimensions this situation will be very rare (which I believe is equivalent to what you’re saying).

The point is that interpolating (or extrapolating) among random points drawn from the typical set is unlikely to produce a point that’s also in the typical set. That is, you won’t produce a point drawn from the posterior, which means that the proposals will tend to get rejected unless they stay close to one of the points in the interpolation set.

Didn’t see this earlier, but I can answer now.

1. Yes, you can choose any set with 1 – epsilon probability. If you look at the volume around the mode, it’s miniscule in high dimensions, so even if you choose your probability 1 – epsilon set to include the mode, you won’t see any draws from it. It’s not like anyone’s excluding the mode or pushing things toward the typical set. It’s just how the points get drawn. I’m about to roll out a case study on all this with more examples, which is why I’m revisiting the blog post.

2. No, proposals near the mode will

notbe accepted with high probability. If you follow the Goodman and Weare paper, you’ll see they do a Jacobian adjustment (for change in volume). The problem everyone’s having is reasoning in terms of densities rather than in terms of measure. Densities only matter under integral signs to compute measures. Only the mass matters, that is, the density integrated over volume. The point is that the volume around the mode is so low that having a high density can’t compensate and the overall mass remains small.3. I’m afraid this is based on a false presupposition. See (2). And seriously, if you don’t believe the math, try it computationally!

Pretty much *all* the draws should be in the typical set. If you’re accepting at a 38% rate in 100 dimensions, my guess is that you’re not moving very far from the points you started with. Can you measure how far the new draws are away from the closest particle in the set from which you’re interpolating/extrapolating? My point wasn’t that these methods wouldn’t draw from the typical set, but rather that they’ll be slow to explore the typical set.

Personally, all my comments were based on probability (measure), not density. And on the basic definitions of typical vs high probability (not high density!) sets.

> If you’re accepting at a 38% rate in 100 dimensions, my guess is that you’re not moving very far from the points you started with.

I think the 38% he quoted is not about the proposals being accepted, but about them falling within the typical set. A proposed transition between two points in the typical set is not automatically accepted, it depends on the relative probabilities (and this ratio can be a very small number).

I was also confused by this issue, due in part to this quote from the “conceptual introduction” paper (fig 10) “Smaller proposal variances stay within the typical set and hence are accepted” which seems to suggest that if the proposal is within the typical set it will be accepted.

I feel that the figure showing “atypicality of interpolated draws from a standard Normal” (above) is somewhat misleading. In the Goodman and Weare algorithm, a proposal for position x(t) is generated like this:

y = xref + Z * (x(t) – xref)

where xref is a reference walker and Z has PDF g(z) = c/sqrt(z) for z in [1/a,a] and a > 1. In the simulation above, Z is replaced by 0.5 every time which is not how the ensemble sampler does it. It would be interesting instead to repeat the simulation with Z drawn from g and then compare if the proposal values cover the true values or not.

Typically a is taken to be 2 but in principle it is a tuning parameter. So we could repeat the same experiment for a = 2, 1.8,1.5,1.2 etc. Just like stepsize needs to be adapted for HMC, my guess is that adapting a to the particular problem at hand will give better results. Similarly, just like stepsize is jittered, parameter a could be randomly jittered as well.

There’s a Jacobian adjustment in the Goodman and Weare Metropolis step—otherwise it wouldn’t work. My point was just that the only proposals that are going to be accepted are ones that are near one of the starting points [x(t) or xref in your notation]. Anything too far in between or beyond the two points will just get rejected. So you can tune the proposal all you want, but it better propose points near where they started, or they’re going to get rejected.

Similarly, in differential evolution (the other paper cited), which uses two other points and moves a third along the direction between the other two, you won’t be able to move very far without leaving the typical set.

Both of these approaches will let you take small steps that adjust for global covariance, but most of the interesting problems, like hierarchical models, have varying curvature (Hessian and hence covariance changes with different parameter values). This only helps after some degree of convergence when the points are arrayed with their posterior covariance.

P.S. We’ve never found a way to make jittering useful for HMC. If the steps get too small, we wind up taking lots of steps in order not to fall back to a random walk.

I’ve just noticed that the case study has been published at http://mc-stan.org/documentation/case-studies/curse-dims.html A few comments follow.

3 Typical Sets / A Continuous Example of Typicality

“The fact that typical sets exist is why we are able to compute posterior expectations using Monte Carlo methods—the draws from the posterior distribution we make with Monte Carlo methods will almost all fall in the typical set and we know that will be sufficient.”

How could the typical set not exist? Let’s take for example the N-dimensional probability distribution uniform in the hypercube [0 1]^N. The typical set will be, I think, trivially equal to the whole distribution. Maybe with your definition the typical set doesn’t exist in this case, but Monte Carlo methods will work just the same.

“We saw in the first plot in section 4”

Was the order of the sections modified?

3 Typical Sets / Definition of the Typical Set

“The formal definition of typicality is for sequences x1,…,xN

of i.i.d. draws with probability function p(x) [……] assuming the Xn are independent and each has probability function p(x).”

This may be applicable for the standard multivariate normal, but not for general probability distributions.

3 The Normal Distribution

There is something wrong with section numbers.

4 Vectors of Random Unit Normals / Concentration of measure

How is this relevant in general? In this example, there is isotropy and we calculate the expectation of the distance to the center. All the points with the same likelihood are equivalent and there is a concentration of this value around sqrt(n) as n goes to infinity. And if we look at the value of the likelihood, the points with the same likelihood are obviously equivalent so we see again the concentration at the corresponding value.

For more general functions of the parameter vector, points in the parameter space with similar likelihood do not correspond to similar values. For example, that is the case for the Bayesian parameter estimate mentioned in a previous section.

Maybe the point can be made that for an N-dimensional distribution the typical set (defined as a band of constant likelihood) will be concentrated as N goes to infinity on a (N-1)-dimensional subspace. But I don’t really see any practical consequence.

4 Vectors of Random Unit Normals / No individual is average

“the average member of a population is an outlier.”

True when the metric we look at is ‘distance to the average’.

“How could an item with every feature being average be unusual? Precisely because it is unusual to have so many features that close to average.”

You don’t even need many features. Let’s say the height in a population is distributed as Normal(170,15). The average absolute difference between an individual’s height and 170 is 15. For someone measuring 170 the deviation from the mean is indeed unusually low at 0 (two-sided p-value 0.05). [I’ve not checked the numbers, the details could be wrong]

“As we saw above, the average member of a population might be located at the mode, whereas the average distance of a population member to the mode is much greater.”

What is interesting about the average distance to the mode in general?

Search result landed me here and just wanted to ask to clarify on “ensemble methods” here being a bit different from what is being called “ensemble methods” in ML. I just skimmed the above but it sounds like non-interacting particle system stuff and if I turn the ML ensembles into something more sensible from a Bayesian sampling perspective they (for example gradient boosting) are more like an interacting particle system growing to achieve some minimal loss against a data generation process (and no explicit prior usually).

Anyway, I’m probably way off on my guess but was hoping this might help the next person who likely lands here from a search on “ensemble methods” and “curse of dimensionality”.