This:

https://projecteuclid.org/euclid.aoap/1034625254

?

]]>Have trouble tracking down “Andy and Gareth and Gilks’ incredibly important paper”. Mind sharing a link / more complete reference?

]]>Anyway, I’m probably way off on my guess but was hoping this might help the next person who likely lands here from a search on “ensemble methods” and “curse of dimensionality”.

]]>3 Typical Sets / A Continuous Example of Typicality

“The fact that typical sets exist is why we are able to compute posterior expectations using Monte Carlo methods—the draws from the posterior distribution we make with Monte Carlo methods will almost all fall in the typical set and we know that will be sufficient.”

How could the typical set not exist? Let’s take for example the N-dimensional probability distribution uniform in the hypercube [0 1]^N. The typical set will be, I think, trivially equal to the whole distribution. Maybe with your definition the typical set doesn’t exist in this case, but Monte Carlo methods will work just the same.

“We saw in the first plot in section 4”

Was the order of the sections modified?

3 Typical Sets / Definition of the Typical Set

“The formal definition of typicality is for sequences x1,…,xN

of i.i.d. draws with probability function p(x) [……] assuming the Xn are independent and each has probability function p(x).”

This may be applicable for the standard multivariate normal, but not for general probability distributions.

3 The Normal Distribution

There is something wrong with section numbers.

4 Vectors of Random Unit Normals / Concentration of measure

How is this relevant in general? In this example, there is isotropy and we calculate the expectation of the distance to the center. All the points with the same likelihood are equivalent and there is a concentration of this value around sqrt(n) as n goes to infinity. And if we look at the value of the likelihood, the points with the same likelihood are obviously equivalent so we see again the concentration at the corresponding value.

For more general functions of the parameter vector, points in the parameter space with similar likelihood do not correspond to similar values. For example, that is the case for the Bayesian parameter estimate mentioned in a previous section.

Maybe the point can be made that for an N-dimensional distribution the typical set (defined as a band of constant likelihood) will be concentrated as N goes to infinity on a (N-1)-dimensional subspace. But I don’t really see any practical consequence.

4 Vectors of Random Unit Normals / No individual is average

“the average member of a population is an outlier.”

True when the metric we look at is ‘distance to the average’.

“How could an item with every feature being average be unusual? Precisely because it is unusual to have so many features that close to average.”

You don’t even need many features. Let’s say the height in a population is distributed as Normal(170,15). The average absolute difference between an individual’s height and 170 is 15. For someone measuring 170 the deviation from the mean is indeed unusually low at 0 (two-sided p-value 0.05). [I’ve not checked the numbers, the details could be wrong]

“As we saw above, the average member of a population might be located at the mode, whereas the average distance of a population member to the mode is much greater.”

What is interesting about the average distance to the mode in general?

]]>You wrote: ” Goodman even implicitly acknowledged that in a paper, https://arxiv.org/abs/1509.02230, although he seemed reticent to acknowledge that in person.” It seems like you confused Jonathan Goodman here (co-author of the original Ensemble Samplers with Affine Invariance) with Jesse Goodman (co-author of Properties of the Affine Invariant Ensemble Sampler in high dimensions).

]]>There’s a Jacobian adjustment in the Goodman and Weare Metropolis step—otherwise it wouldn’t work. My point was just that the only proposals that are going to be accepted are ones that are near one of the starting points [x(t) or xref in your notation]. Anything too far in between or beyond the two points will just get rejected. So you can tune the proposal all you want, but it better propose points near where they started, or they’re going to get rejected.

Similarly, in differential evolution (the other paper cited), which uses two other points and moves a third along the direction between the other two, you won’t be able to move very far without leaving the typical set.

Both of these approaches will let you take small steps that adjust for global covariance, but most of the interesting problems, like hierarchical models, have varying curvature (Hessian and hence covariance changes with different parameter values). This only helps after some degree of convergence when the points are arrayed with their posterior covariance.

P.S. We’ve never found a way to make jittering useful for HMC. If the steps get too small, we wind up taking lots of steps in order not to fall back to a random walk.

]]>y = xref + Z * (x(t) – xref)

where xref is a reference walker and Z has PDF g(z) = c/sqrt(z) for z in [1/a,a] and a > 1. In the simulation above, Z is replaced by 0.5 every time which is not how the ensemble sampler does it. It would be interesting instead to repeat the simulation with Z drawn from g and then compare if the proposal values cover the true values or not.

Typically a is taken to be 2 but in principle it is a tuning parameter. So we could repeat the same experiment for a = 2, 1.8,1.5,1.2 etc. Just like stepsize needs to be adapted for HMC, my guess is that adapting a to the particular problem at hand will give better results. Similarly, just like stepsize is jittered, parameter a could be randomly jittered as well.

]]>I think the 38% he quoted is not about the proposals being accepted, but about them falling within the typical set. A proposed transition between two points in the typical set is not automatically accepted, it depends on the relative probabilities (and this ratio can be a very small number).

I was also confused by this issue, due in part to this quote from the “conceptual introduction” paper (fig 10) “Smaller proposal variances stay within the typical set and hence are accepted” which seems to suggest that if the proposal is within the typical set it will be accepted.

]]>Personally, all my comments were based on probability (measure), not density. And on the basic definitions of typical vs high probability (not high density!) sets.

]]>I was getting confused because I was thinking the top plot shows the log-density of proposals (and since they’re higher than in the typical set, they’d automatically be accepted). But that’s not how plain jane DE-MCMC works, which I ought to have remembered since I coded it up in Matlab one time. I was right the first time: as dimensionality increases DE-MCMC is going to have a hard time finding the direction that points up the log-density.

The proposition ‘”DE-MCMC proposal was accepted” implies “proposal was close to one of the original points”‘ isn’t true. (I’m not asserting that this was your claim.) DE-MCMC picks two points at random from the ensemble, takes their difference, scales it, and then the proposal is current state + scaled difference vector. So if the two points are far apart and one of those point is close to the current state then the proposal could be deep into the region of high density interior to the typical set. It’s just that in high dimensions this situation will be very rare (which I believe is equivalent to what you’re saying).

]]>We can’t actually cover the posterior in high dimensions. Think of it this way, there are 4 sign possibilities in 2 dimensions, 8 in 3 dimensions, and in general $latex 2^N$ in $latex N$ dimesnions. No way to even get a draw in each quadrant in 100 dimensions. At most, we can hope to compute expectations. And the marginal distributions in fewer dimensions can have coverage, just not jointly.

So yes, we do want to take posterior draws from $latex p(\theta \, | \, y)$. But no, we can’t cover the posterior by so doing.

And when we take draws from the posterior, they pretty much all fall in the typical set by definition.

]]>Didn’t see this earlier, but I can answer now.

1. Yes, you can choose any set with 1 – epsilon probability. If you look at the volume around the mode, it’s miniscule in high dimensions, so even if you choose your probability 1 – epsilon set to include the mode, you won’t see any draws from it. It’s not like anyone’s excluding the mode or pushing things toward the typical set. It’s just how the points get drawn. I’m about to roll out a case study on all this with more examples, which is why I’m revisiting the blog post.

2. No, proposals near the mode will *not* be accepted with high probability. If you follow the Goodman and Weare paper, you’ll see they do a Jacobian adjustment (for change in volume). The problem everyone’s having is reasoning in terms of densities rather than in terms of measure. Densities only matter under integral signs to compute measures. Only the mass matters, that is, the density integrated over volume. The point is that the volume around the mode is so low that having a high density can’t compensate and the overall mass remains small.

3. I’m afraid this is based on a false presupposition. See (2). And seriously, if you don’t believe the math, try it computationally!

Pretty much *all* the draws should be in the typical set. If you’re accepting at a 38% rate in 100 dimensions, my guess is that you’re not moving very far from the points you started with. Can you measure how far the new draws are away from the closest particle in the set from which you’re interpolating/extrapolating? My point wasn’t that these methods wouldn’t draw from the typical set, but rather that they’ll be slow to explore the typical set.

]]>This is what the Metropolis adjustment does. It’s actually dealing with volume in a subtle way to do the required Jacobian adjustment. The basic problem, as I keep stressing, is that interpolating or extrapolating among points drawn from the posterior is unlikely to be another point that looks like a point drawn from the posterior at random (and hence not in the typical set).

]]>That’s only true after the Metropolis adjustment (if I’m recalling DE correctly). What’s going to happen is that you’ll make a proposal, and unless it’s close to one of the original points, it will be rejected. Again, interpolations and extrapolations among points in the typical set are unlikely to fall in the typical set, hence they’ll get rejected by Metropolis because keeping them would lead to biased draws.

]]>The point is that interpolating (or extrapolating) among random points drawn from the typical set is unlikely to produce a point that’s also in the typical set. That is, you won’t produce a point drawn from the posterior, which means that the proposals will tend to get rejected unless they stay close to one of the points in the interpolation set.

]]>I just ran across this preprint, “Probabilistic Path Hamiltonian Monte Carlo” (http://arxiv.org/abs/1702.07814). May be relevant?

]]>My intuition is leading me astray somehow — I was reasoning from the theorem that if you start with independent samples from the target distribution and step one of them through a DE-MCMC step, the resulting state has the target distribution as its marginal distribution.

]]>Side point – has anyone looked at giving a notion of volume to the individual walkers in ensemble methods and then incorporating volume exclusion effects in the updates? Presumably this could help avoid crowding around the mode (but this is really just a naive thought).

]]>This may in fact be the difference between “asymptotically for long time” and “in actual terms before I get tired of running the algorithm”

]]>In fact the time to generate independent samples doesn’t depend on the dimension, so now I’m not sure there is a problem at all… At least in this example, if we ignore the burn in (and assume the ensemble is large enough) N=100 doesn’t seem to be problematic. What are we missing?

]]>I understood the original post as making a completely opposite point: that ensemble methods, or specifically the Goodman and Weare affine method, are too good at finding the mode and therefore inefficient at sampling from a distribution where most of the density lives reasonably far away from the mode.

]]>…both HMC and DE (and the rest) are bound to draw samples from the typical set with exactly the same frequency. I think I must have missed something here. Perhaps your point #1 is intended to relate to shorter subsections of the chain, where the large-number limits don’t apply?

The idea here is that due to DE’s random-walk-like behaviour the effective sample size for a given amount of computation will be much smaller than that of HMC.

However, proposals near the mode will be accepted with high probability.

The idea here is that in high dimensions a random walk, even one that always accepts moves up the density, will have a hard time finding the region near the mode — volumetrically it’s a needle in a haystack. If you start near the mode, the random walk will wander away from it and eventually hit the typical set and then stay in that general vicinity.

]]>By the way, the same argument could be done directly in terms of the volume of the high probability region and the volume accessible at each step, instead of looking at the typical hyper surface.

]]>I think the problem is that the points generated are relatively close. In your example, the distance between the base points and the proposals is around 3.5 and very rarely above 4.5. This is almost half the radius of the hypersphere, we could say it’s of the same order of magnitude. But it seems minuscule if we compare the area “covered” in a single step with the total space, so it will take forever to explore it. The ratio of the hypersurface of the hypersphere (taking conservatively R=9) to the area of the 99-dimensional “hyper circle” on the hyper surface (taking R=4.5, in the high end of the sampled values and ignoring that the tangential distance will be lower) is around 10e30 (and 10e40 may be a better estimate).

]]>>>>The author of those class notes wrote a paper on the topic (just saw the intro—didn’t read it):<<<

This last bit has been a pet peeve of mine. At least in Engineering I sensed a tension between teaching students (esp. undegrads) what serves *students'* interest the best. Versus teaching topics that the Prof. is most comfortable / familiar with (e.g. has published a paper on recently or has a grant for).

I knew of at least two Profs. who taught undergrad courses significant chunks of material closely aligned with their work but of very marginal utility to the typical undergrad. The opportunity cost is the crowding out of other (more useful) material.

But I guess utility is always subjective. But I wish more Profs. kept this discord in mind when choosing teaching material / courses.

]]>+ 1. I will add that for some smaller-scale problems, I still get a lot out of plotting prior vs posterior densities using draws from Stan. Reading Betancourt’s stuff has made me appreciate though that the only *mathematically well-defined* operation to do with these densities is integration (i.e. compute expectations).

]]>I really like this blog. Thank you for contributing to scientific discourse in this way – it’s important, and much appreciated.

I have assigned the above blog entry as a reading for the journal club that I run, on the suggestion of one of the PhD students who attends the club. We’ll discuss this next week, but I had a couple of issues I’d like to be clearer on before then.

Starting with the “executive summary” of five points in your blog, I wonder if you would mind elaborating on the following:

1. Your point #1 is that “we want to draw a sample from the typical set”. The difficulty I have understanding this is that any Markov chain which has reached is stationary state is obliged to draw samples from the typical set with probability proportional to the mass in that set (by definition of the chain). Obviously, the chain will also draw samples from outside the typical set (e.g. from the mode) with probability proportional to the mass in that subset. So, both HMC and DE (and the rest) are bound to draw samples from the typical set with exactly the same frequency. I think I must have missed something here. Perhaps your point #1 is intended to relate to shorter subsections of the chain, where the large-number limits don’t apply?

2. Your point #4 is that “the only steps that get accepted will be near one of the starting points”. This seems to restrict attention to proposals in the typical set. However, proposals near the mode will be accepted with high probability. These proposals are also generated with high probability by particle-based methods, which means that many proposals *other* than those near the starting point will be accepted.

3. If we take my point #2, that lots of proposals are generated and accepted outside of the typical set (e.g. in the region of the mode), then I can’t quite see how the sampler will devolve to a random walk (your point #5). Could you perhaps elaborate?

Given these issues, I investigated the properties of proposals generated by Ter Braak’s differential evolution, using simulation. I adopted your blog’s set-up, with a K=100 dimensional multivariate normal, with identity covariance matrix and zero mean. For this, the distance of samples from the mode follows (the square root of) a chi-squared distribution, with df=K. Suppose we define the typical set as the inter-quartile range. For K=100, that typical set is the range between about 9.5 and 10.5. Just as Betancourt points out, this typical set is quite narrow, and also is well away from zero.

Now, by definition of the inter-quartile range, 50% of samples from the target distribution fall in this typical set. So I wondered was the corresponding probability was, for proposals generated by Ter Braak’s differential evolution. I used the setting of lambda which has become standard in many applications: 2.38/sqrt(K).

In simulations, about 38% of these proposals fall in the target density’s typical set. That seems pretty good, right? Not that much smaller than for samples directly from the target distribution. I’ve put code for the simulations below, in case someone can find a mistake there.

I think I must have missed something crucial about the points in your blog post, and those that Betancourt made in his talk. Perhaps the problems being discussed, with particle methods, were about the probability of a particle method finding the typical set from bad start points? My simulations don’t address that, and maybe that’s what you were meaning.

Any pointers would be much appreciated.

Scott.

### SOME R CODE FOR THE ABOVE ###

K=100

n=1e4

# Generate some samples from the target distribution.

s1=array(rnorm(K*n),dim=c(K,n))

# The distance of each sample from the mode is:

d1=apply(s1^2,2,sum)^0.5

# Define the IQR as the “typical set”.

IQR=qchisq(p=c(.25,.75),df=K)^0.5

# For K=100, that is about (9.5,10.5).

# Marginal probability of a sample falling in the typical set is 0.5, by

# definition of the IQR. Now lest see what it is for DE proposals.

lambda=2.38/sqrt(K) # Standard setting.

s2=array(rnorm(K*n),dim=c(K,n)) # More samples.

s3=array(rnorm(K*n),dim=c(K,n)) # Again more.

p=s1+lambda*(s2-s3) # Differential evolution proposals.

dp=apply(p^2,2,sum)^0.5 # Distance of proposals from mode.

# What is the proportion of DE proposals that fall inside the

# typical set of the target distribution?

in.typical.set=(dp>IQR[1])&(dp<IQR[2])

print(mean(in.typical.set))

# Visualise how the distribution of distance-from-mode differs between

# the target distribution and the DE proposals.

plot(density(d1),main="Distance of samples from mode.")

lines(density(dp),col="blue")

abline(v=IQR,col="red")

legend(x="topleft",legend=c("Target Dist","DE Proposals"),lty=1,col=c("black","blue"))

# It looks like the proposals from DE cover the typical set pretty well.

@angus

As far as I am aware, in all/most of these large-scale Bayesian inference problems it is assumed that all/most questions are formulated in terms of computing the posterior expectation of some function. Essentially, one is looking to extract information in the form of functionals of the posterior – these could be the mean, median, variance etc of the posterior, or the expectation of some other function.

Thus one doesn’t look at the (high-dimensional) posterior itself, rather a collection of ‘properties’ of the posterior defined in terms of expectations.

]]>Angus, I don’t think Daniel’s reply addresses your lack of clarity on MCMC (which doesn’t seem to have anything to do with regions of high density).

If I understand you correctly, you see a tension or contradiction between BDA3 focussing on how we want MCMC to give us (an estimate of) *the posterior distribution* and Michael Betancourt saying that we’re using MCMC to give us *posterior expectations*. That’s why you’re asking if there’s a extra step at the end to transform the expectation back into the distribution.

There are two points of confusion here. They are: (i) the reason why we want (samples that look like samples from) the posterior distribution per BDA3, and (ii) exactly what Stan does to get all those nice expectations and quantiles it computes from its HMC sampler. (i) The reason why we want those (samples that look like) posterior samples is because actually our end goal is to compute posterior expectations and the samples give us a good way to do that. (ii) Stan uses the HMC sampler to produce samples and then it uses those samples to compute (well, estimate) the expectations and quantiles. So there’s no extra step to get *back* the posterior distribution — you can just make Stan give you its samples and then you can use them to estimate whatever posterior expectation is of interest to you. Stan automatically reports some canned expectations that are almost always the thing we as statisticians are interested in.

None of this has anything much to do with typical sets; that’s a red herring relative to the things you weren’t clear about.

]]>Angus: we’re using MCMC to get a set of points which are “as if” sampled independently from the posterior distribution (at least for the purpose of computing expectations, which are of course insensitive to the order in which you sample things).

The typical set gives you ONE way to describe what a sequence of samples “should look like”, namely that the probability of the sequence should be very close to exp(-N*H) where N is the number of samples and H is the entropy of the posterior.

Thanks to the way that HMC is designed, the only step you need to take at the end of your run is to go ahead and compute a sample expectation from the sample you get.

It just so happens that in high dimensions, independent random samples from the posterior have the property that they never in a gazillion years include the points right near the highest density location. This is just the math of high dimensional spaces, and HMC, even though it isn’t a process for *independent* random sampling does give you a sample which has this same property. It does so without explicitly “excluding” this region, it just is the case that HMC stays away from this region just like independent random sampling does.

Hope that clarifies things.

]]>In Michael’s Stan Con lecture, he talks about typical sets and expectation. The posterior is in the integral for the expectation. But I thought we were using MCMC to just get the posterior distribution. Going back through Gelmans BDA3 he says that we want the stationary distribution of the markov process to be p(theta|y).

Michael argues that points away from the typical set don’t give much to the expectation, so we don’t need sample them to calculate it. But does that means there is some extra step at the end to take that expectation and get back the posterior distribution? Is this a more optimal way of estimating the posterior? ]]>

I should add that there are also implications for Euclidean Hamiltonian Monte Carlo, which in each iteration explores a level set with a fixed Hamiltonian (potential plus kinetic energy, where the potential is the negative log density and the kinetic is random standard normal). With random standard normal kinetic energy in N dimensions, the log density can change at most by a chi-square(N) variate. Michael goes over this in some of his early papers on the principles of HMC. Riemannian Hamiltonian Monte Carlo mitigates this problem, which is why it can explore the funnel density efficiently.

]]>If you take independent draws at random from a high-dimensional multivariate normal, they will all fall into the high probabilty set and all fall into the typical set. But if you look at the max log density you get from 1000 draws (as I plotted above for a 100-dimensional normal), or even 1,000,000 draws, you won’t get a draw anywhere near the log density of the mode. You can use the chi-square inverse CDF to calculate the tail probabilities of (squared) distances to the mode.

This has nothing to do with Stan! It’s just how sampling works in high dimensions because there’s essentially no probabilty mass around the mode, so you never get draws from near the mode.

There are deeper implications for sampling. In general, if you’re computing an expectation from a Markov chain (or even from independent draws), you can replace the draws with draws that have the same expectation and ideally lower variance. For example, suppose you have a Markov chain and instead of thinning (keeping every 1000th draw or whatever you have to do to get Metropolis output into memory), you average (replace every 1000th draw with its average). The second sequence is not distributed according to the original distribution. It has the same mean, but lower variance. So you can use averages of draws to compute expectations. That’s what Goodman and Weare go over in their paper. You just can’t save these averaged draws and compute expectations for new nonlinear functions of the parameters after the fact—you have to save the averages for functions you’re computing. And you can’t use the averaged draws to compute quantiles in the usual way.

]]>OK this plus Bob Carpenters note below clears this up for me. Thanks!

]]>As Bob said, the typical set is defined in terms of “those points where the negative log density is within epsilon of the entropy” since Stan moves around a hypersurface defined by the negative log density, the idea is that all the Stan samples should have negative log density right about at some constant +- epsilon and that it should be very different from the log density near the mode, which is also the log density found by optimization… so I think in the context of explaining Stan, emphasizing the definition of the typical set in terms of the size of the negative log density could be helpful.

]]>‘This’ meaning whether the subtle distinction could lead to different algorithm design or not. Or have implications for improving rather than discarding ensemble samplers etc. I’m not really sure either way.

]]>This is really the crux of the issue for me too – and I wonder if this has any implications for algorithm design or not.

]]>Well, this discussion ended far more positively than typical internet discussions! ;-)

Thanks for the case study and helpful comments – I wasn’t intending to nitpick, just genuinely interested in understanding the issues at stake.

Now to actually try Stan properly…

]]>I hope you can describe this general definition some time.

The usefulness of the typical set that you describe is also present in the high probability set. If you take a million draws, etc, the will have log densities like those in the high probability set. Does the typical set have any additional advantage apart from making the “skin of the multidimensional orange” point more obvious?

]]>Yes to all three questions.

Yes, there’s a general definition (assuming continuity and discreteness and I’m guessing a measure theorist could generalize).

Yes, you’re right that the typical set isn’t the smallest volume set containing one minus epsilon total probabilty mass.

Yes, the typical set is a useful concept because it illustrates where random samples actually fall. By that, I mean that if you take a million draws from a 100-dimensional normal and plot their log densities, all of the draws will have log densities like those in the typical set, not like those of the mode. We regularly get the question on the Stan mailing list of why draws from the posterior don’t have log densities anywhere near as good as the max a posteriori estimate and doesn’t that imply Stan is broken because it’s not producing the “best” answer. That’s why I’m writing the case study.

It’s so nice to write “yes” rather than “no”!

]]>Evolution at the micro (genetic) and macro (population) levels. At the cellular level, the coding is apparent—cells are just little thermodynamic computers that run discrete processes with proteins (and micro-RNA and whatnot). You even see a lot of information theory in language evolution models, such as how the sounds of a language code discrete symbols and how changes are likely to happen (for example, irregularities go away over time unless they’re very common words, which is why English still has case in its pronouns and a grab bag of odd auxiliary verb structure that’s person sensitive in a way no other verbs are).

The author of those class notes wrote a paper on the topic (just saw the intro—didn’t read it):

http://octavia.zoology.washington.edu/publications/BergstromAndRosvall09.pdf

]]>Thanks ojm! I thought everyone was confusing density and probability mass because of the focus on the mode. This will really help clarify the case study on high dimensional mass vs. volumes and the typical set I’m writing up.

I went back and read the relevant bits of Cover and Thomas and see that the typical set and highest probability set aren’t the same.

The high probability set (for a given epsilon) is the set with the smallest volume with probablity one minus epsilon.

The typical set is defined as the set with one minus epsilon probability centered (in the sense of containing log density level sets plus or minus some value) around the (differential) entropy (expected log density of a draw).

I think we all agree that when we take a random draw from a standard (unit covariance) multivariate normal high dimensions, we have a 1 – epsilon chance of getting a draw in either set—that’s just the definition! I think we also agree that the draw is astronomically unlikely to be anywhere near the mode, despite the mode being in the high probability set and not the typical set. Of course, the draws are not likely to be anywhere near each other, either.

Of course, in some densities, like uniforms, the typical set covers the whole interval (every point has same density, so they all have log density exactly equal to the expected log density, and the high probability set isn’t well defined.

Thanks again—this was really helpful.

]]>Carlos gives a good summary below.

> the typical set doesn’t define the smallest volume that contains “practically all” the distribution. You can get a slightly smaller set with the same probability mass by adding the core of the hypersphere and scrapping a thin layer from the outer shell. That’s the high probabilty set

Does this have implications for the sampler discussion? I don’t know. I do know that one of Bob’s premises was that we want to sample from the typical set. He further seem to emphasise that sampling the mode was inherently bad (eg ‘by definition’ of the typical set).

These premises seem technically false to me. Whether that’s enough to save ensemble samplers is a different question.

]]>Ok, if we restrict ourselves to N-dimensional probability distributions generated from a sequence of N i.i.d. random variables we can define a typical set. But I’m not sure if the concept is supposed to be valid for arbitrary probability distributions in N dimensions. And after reading Daniel’s comment I’m not sure if you’re talking purely about the N-dimensional distribution (i.e. the sequence of coordinates) or if there is also a sequence of N-dimensional points involved somehow.

And, as ojm mentioned, the typical set doesn’t define the smallest volume that contains “practically all” the distribution. You can get a slightly smaller set with the same probability mass by adding the core of the hypersphere and scrapping a thin layer from the outer shell. That’s the high probabilty set.

In information theory the typical set (sequences with sample entropy “around” the entropy of the random variable) can be useful to obtain interesting results. Are you defining the typical set here just for illustration or does it have any practical implication?

]]>edit for clarity: by “this” I mean the subtle distinction between ‘high probability set’ and ‘typical set’ that you are discussing.

]]>ojm, is another way of phrasing what you’re getting at, that for an arbitrarily small volume in continuous parameter space, the integral over the density is technically highest at the mode? But then, as Daniel Lakeland says, you want to length K sequences (where K is sufficiently large) to actually compute expectations, and then we go back to talking about accumulating probability masses over relevant volumes (which of course grow with dimensionality) and how any good sampler will need to stay in the typical set?

I guess I still don’t see how this relates to Bob Carpenters criticism of ensemble samplers, but I’m not sure it’s very important that I understand this, so no worries if we just want to move on ;)

….or alternatively, the zoology dept. is teaching its undergrad stuff of questionable utility. What use is any of this to a typical zoology undergrad? I’m straining to see any application. And that too, of the rigor and theorem-proving which seems in those lecture notes.

]]>For those who still (or ever did) care, the notes referenced above state that the definitions are taken directly from Cover and Thomas. Funnily enough, Cover and Thomas have a good, explicitl discussion of high probability sets vs typical sets. The connection is, of course, the AEP which explains why typical sets ar a good approximation to high probability sets.

]]>Ugh…you’re getting at…etc etc

]]>