so, either IID sampling, or HMC sampling will both give you points that are in the typical set and *not* particularly near the mode in euclidean distance.

If you want something that can’t be calculated effectively by IID sampling, such as the location of the mode, then you also won’t be able to calculate it effectively with HMC type sampling.

The point of HMC sampling is that it gives you a sample which is “close to IID” in the sense of effective sample size is nearly the same as the actual sample size. Anything you can calculate with IID samples you can calculate pretty well with HMC samples…

]]>Bayesians don’t do optimization when it can be avoided. What we care about for Bayesian inference is posterior expectations. For example, point estimates are typically calculated as posterior means, because that minimizes expected squared error (relative to the model, of course); poserior median minimizes expected absolute error. The posterior mode doesn’t have any such properties. So we don’t pay much attention to the mode.

Having said that, Andrew likes to think of a Laplace approximation on the unconstrained scale as a kind of approximate Bayesian posterior. That makes it very much like variational inference, which drops an approximate distribution on an approximate posterior mean rather than posterior mode.

]]>*constant* as in “ranging over ~35 orders of magnitude” (for the 95% set in the 1000-dimensional standard gaussian).

]]>It will also be naturally attracted to the HPD set, for the same reason. Actually it will be slightly more attracted to the HPD set, given than it is slightly more compact than the typical set (for the same probability mass).

> (which is typically near the boundary of the 95%HPD set)

In high dimensions, the whole HPD set is also near its boundary.

]]>The funny thing is that if you have a multi-dimensional standard normal distribution the median of two random points is distributed as a multi-dimensional normal distribution of variance 1/2. Therefore, it’s enough to modify the algorithm slightly from (a+b)/2 to (a+b)/sqrt(2) to fix the problem (whatever the number of dimensions).

Maybe ensemble algorithms have issues, but I don’t think that blog post provided any indication of it.

]]>Of course, the sense in which the mode is excluded from the typical set is also relevant—its log density is atypically high. If you take lots of draws from the posterior, their log density will be bounded pretty far away from the mode’s log density. Again, nothing needs to exclude the mode—draws near it in log density just don’t come up.

]]>I do fully understand when you say the 95% HDR region you are talking about what I and OJM were calling the High Probability region. I guess the point is that when you use the “typical set” or you use the 95% HDR region, essentially all the points are *in both* and so the distinction is mainly occurring in regions that actually don’t matter, like in my example the radius less than 3.7 ball…

In the more general high dimensional posterior geometry, defining the 95%HDR region as “the smallest volume set containing 95% of the probability” is fine, but constructing it is basically impossible, in other words, you can’t compute with it…. But you *can* compute with the typical set because it’s the “set where density is essentially constant and equal to exp(lp*)+-epsilon for some particular lp* and epsilon sufficient to get you 95% of the mass.

HMC naturally is attracted to this set because it does correct unbiased sampling, and essentially all the samples are in this set (which is typically near the boundary of the 95%HPD set)

]]>But this holds only for the second example, but not in the first example.

]]>That’s the HD.

> *and* high total probability

That’s the 95%.

> not just “the region where the probability density function is above some threshold”.

More precisely the region where the probability density function is higher than the threshold that makes the probability mass contained in the region 95%.

I’m not sure what do you find so dificult about the concept of the HDR. It is the natural extension of the one-dimensional HDI. I’m sure you know and understand HPD intervals (the shortest credible intervals). How is this different?

]]>In practice, fitting models in Stan, my complicated models often need very small timesteps and this is because they have to follow a complicated curved “surface” in N dimensions, a directed trajectory “along the eggshell”

]]>If we took the region where probability density in the 1000 dimensional space is above say 0.001 times the peak density… which is a pure density based definition of a set, then this set is the ball of radius less than around 3.7

and this ball is inside your 95% High Probability ball around 0 but *does not intersect the typical set*, and this ball has essentially zero points in it precisely because it doesn’t intersect the typical set.

my calculation:

dnorm(3.7)/dnorm(0)

[1] 0.001064766

pchisq(3.7^2,1000,log.p=TRUE)

[1] -1656.402

so the total probability inside the set of density greater than 0.001 fraction of the peak density is exp(-1656)

]]>The 95% HDR *is* the ball of radius 32.8 (assuming your calculation is correct).

The 95% HDR *is not* the shell between 30.8 and 33.0 (obviously, because the probability density is higher for the points in the inner “hole” than for the points in the shell).

]]>sqrt(qchisq(.125,1000)), sqrt(qchisq(.975,1000))

which is the shells for radius between 30.8 and 33.008

But ojm’s set, is the ball of radius less than

sqrt(qchisq(.95,1000)) = 32.7823

so includes the whole region from radius 0 to 32.7823 and this volume is slightly smaller than the volume between 30.8 and 33.008

but in both cases, almost all the points are in the range r = 31 to 32 and essentially none of the points are inside radius 28

sqrt(qchisq(1e-6,1000))

[1] 28.31297

vec = rnorm(1000)

sum(sapply(vec,function(x),dnorm(x,log=TRUE)))

and get

-1411.434

and sqrt(sum(vec*vec)) = 31.38

now I investigate the density near the mode:

I do sum(sapply(rep(0,1000),function(x) dnorm(x,log=TRUE)))

and get

-918.9385

so the density near my randomly chosen point, which is in the typical set, is exp(-492.5) times less dense than the density near the mode. So I don’t see how we can call that a high *density*

On the other hand, the probability, which involves integrating density * volume is very large for shells of radius 30-35 because the volume of those shells is enormous, not because the density is anything much more than 0 to 200 decimal places.

]]>Are you sure this bias exists? As the number of dimensions increases, the probability that the segment connecting two random points from a hypersphere passes near the origin vanishes.

]]>Yes, but the point of the original post that brought ojms concern to the forefront was that people were proposing algorithms such as sampling along a line between two points which did bias things towards moving outside the typical set (and in the unit normal high dim case, towards the mode), and the metropolis hastings correction required to get unbiased samples caused the algorithms to devolve to tiny-step-size diffusions along the “surfaces” defined by the typical set, and hence they were unhelpful.

]]>No, that’s not true. In the spherical symmetry normal distribution case, the highest density region is the mode near x=0, let’s say with radius 1 or so, but in a 1000 dimensional normal, the radius of such randomly selected points has a chi distribution with 1000 degrees of freedom and in such a distribution 99% of the points are outside radius 30 or so.

so points are not near the “high density” region, but they are in the “high probability” region because the “high probability” region includes everything from radius 0 to radius say 36, however almost all the points are radius 30-36 not anywhere “near” radius 0

]]>Every one of them will be also in the High Density Region portion, virtually guaranteed. And if you calculate how many points can you expect to miss (which we agree will be a very small quantity), the typical set will get worse coverage than the HDR by construction.

> So if we have an algorithm that tends to bias us towards moving in the direction of the mode, to get a correct sample we need to actually *undo this bias* or we will over-sample the mode.

If we had an algorithm that tended to bias us towards moving far from the mode we would also need to undo that bias to get a correct sample. Using unbiased algorithms is probably better :-)

]]>The fraction of points which are in the high density region is also esentially all of them, so if there is a good reason to prefer the typical set it may be somewhere else :-)

]]>For the case considered in your PS the problem is also easy to see in terms of high density regions. The HDR for the proposal covers only a small part of the HDR for the target so it is not surprising that importance sampling behaves badly.

]]>I wasn’t intending that. Suppose we start with the typical set and it’s volume is V.

Now, we want to compare it to the high probability set of *equal volume*. Let’s also just for ease of discussion stick to the independent symmetric normal…

So, the typical set is basically a shell at radius r with half-width w.

The high probability set is the entire ball within radius r+w-epsilon where we choose epsilon so that now the volume we added inside the ball is removed from the exterior “skin” retaining the constant volume V. epsilon is a *very small* number because there is almost no volume inside the ball.

Now the two sets have the same volume by construction, the total probability is basically 1 for both of them to zeroth order, but to first order, the HP set has an epsilon of higher probability. Still, if we randomly sprinkle points inside the HP set according to the probability mass, *essentially all of those points are in the portion of the HP set that intersects the original Typical Set*

If we take the sprinkled points and select from them uniformly at random, with a reasonable finite sample size, say N=100000 then *every one of them will be in the Typical Set portion* virtually guaranteed.

So if we have an algorithm that tends to bias us towards moving in the direction of the mode, to get a correct sample we need to actually *undo this bias* or we will over-sample the mode.

]]>But, due to the probability mass in the vicinity of the mode being nearly zero, the appropriate number of times to sample near the mode in any reasonable sample size… like thousands to millions of samples… is *zero*

whereas the appropriate number of times to have samples whose log(p(x)) = lp* is basically N, the number of samples you’re drawing. Sure, any *given* point in the typical set shouldn’t appear, but the fraction of points which are in the typical set should be essentially all of them.

]]>Where it is stated

> Why ensemble methods fail: Executive summary

1. We want to draw a sample from the typical set

2. The typical set is a thin shell a fixed radius from the mode in a multivariate normal

3. Interpolating or extrapolating two points in this shell is unlikely to fall in this shell

4. The only steps that get accepted will be near one of the starting points

5. The samplers devolve to a random walk with poorly biased choice of direction

Without disagreeing with 2-5, I question the premise 1.

I think one reason I push back is a sense (there and elsewhere) of an emphasis on ‘wow counterintuitive result – the mode is bad’. But this is not so counterintuitive in the sense that the mode is no worse than any other single point. If the mode is included the appropriate number of times, along with other points being included an appropriate number of times, it’s fine. So it seems disingenuous to use that as premise one in criticisms of other methods – even if there are completely legitimate other criticisms to make!

]]>As soon as you go to the second part of what you are saying you are dropping the constraint of comparing sets of fixed volume. Whenever you fix either total probability or volume, the high probability set is natural. Every argument for the typical set after this drops the fixed constraint and hence makes comparisons somewhat arbitrary. I would like to see a single example where the volume or probability is held fixed that makes the typical set more natural.

Of course the volume of a shell of given radius is larger in high dimensions than the small ball around the mode. But that’s not the point. (This is where eg Bob jumps in and says ‘that’s the point’ – but no, it is not my point.

Then Dan jumps in and says ‘well what is your point?’. Then I say something snarky and we unfollow each other on Twitter, and all is well with the world).

In high dimensions, the HP set is basically equal to the typical set, plus a tiny volume near the mode, minus a miniscule wafer thin shell away from the mode. The euclidean distance to the mode from any point in the typical set can be quite large, and yet it doesn’t matter because the volume near the mode is so small, we can go “deep” into this “Modal tail” without adding much probability, hence we only need to shave a razor thin shell off the “outside”

]]>“most of the draws from the proposal are away from the typical set of the target!”

to

“most of the draws from the proposal are away from the high probability region of the target!”

]]>ojm wrote

The mode is in the high probability set and so isn’t inherently ‘bad’ as seems to be the implication of some of Bob’s posts etc.

Important part of my blog post was that in order to get in the asymptotic regime of bounded ratio benefits requires that we get draws from the proposal distribution which are close to the the mode. Close to the mode being maybe inside sphere with radius 5. I mentioned that probability mass close to the mode inside a sphere with radius approximately 21.5 is less than one per the number of atoms in the observable universe. For me it seems unlikely that with current computers we would get draws inside this sphere (r approx 21.5) in my lifetime so I would call that untypical set. If high probability set includes (in probability, epsilon etc.) inside of that sphere (r approx 21.5) and typical set does not include inside of that sphere (r approx 21.5) then I think typical set is more useful for me to describe what happens in finite case. For the theoretical (but completely impractical) asymptotic case I might be fine with high probability set, too.

]]>But from a theoretical perspective, I agree with you that the mode is not “to be avoided” it’s just not very helpful for calculating expectations, because there’s a LOT of mass elsewhere you need to take into account to get good expectation calculations.

]]>In the Stan case, it’s very easy to see, practically speaking, that once the HMC process has settled into stationarity, all the samples are within epsilon of having the same value for log(p(x)). Basically, all the probability mass is in this constant lp region. It’s also all in the HPM region as well, it’s just that these sets overlap on a set of nearly measure 1, and the regions where they don’t overlap has very very small total measure, either because the volume is small (near the mode), or the density falls off dramatically quickly (towards the “tail”).

It seems to me that the reason the Stan crew focuses on the typical set (as defined by near constant log(p)) is that log(p(x)) is something they’re calculating all the time, and the samples all have nearly the same value, and this just happens because of the geometry of high dimensions, not because they have to “avoid” the mode intentionally. When you’re calculating a thing all the time, and using it to determine whether your algorithm is working properly, it’s more obvious to focus on that aspect.

Basically, if you see a bunch of samples that come from a region with dramatically different log(p(x)) it’s an indicator that either you haven’t converged to stationarity of the sampler, or there’s a bug in the code. You shouldn’t ever see a sample near the mode, simply because the volume of the region near the mode is so small you effectively can’t hit it, because you’re starting each iteration with a random velocity vector. Similarly, you shouldn’t see samples with lp much less (more negative) than the typical set value because even though the volume of space with values like that is effectively infinite, the “energy” required to get there is prohibitively large.

So, I think the fact that nearly all the probability mass is in the typical set, combined with the fact that Stan explicitly calculates log(p(x)) at every iteration and spits it out in the output… and the fact that naturally through correct sampling, in high dimensions, essentially every sample has a constant value for log(p(x))… leads to a focus on that set. Rather than any other fundamental reason.

]]>