David MacKay sez . . . 12??

I’ve recently been reading David MacKay’s 2003 book, Information Theory, Inference, and Learning Algorithms. It’s great background for my Bayesian computation class because he has lots of pictures and detailed discussions of the algorithms. (Regular readers of this blog will not be surprised to hear that I hate all the Occam-factor stuff that MacKay talks about, but overall it’s a great book.)

Anyway, I happened to notice the following bit, under the heading, “How many samples are needed?”:

In many problems, we really only need about twelve independent samples from P(x). Imagine that x is an unknown vector such as the amount of corrosion present in each of 10 000 underground pipelines around Cambridge, and φ(x) is the total cost of repairing those pipelines. The distribution P(x) describes the probability of a state x given the tests that have been carried out on some pipelines and the assumptions about the physics of corrosion. The quantity Φ is the expected cost of the repairs. The quantity σ^2 is the variance of the cost — σ measures by how much we should expect the actual cost to differ from the expectation Φ.

Now, how accurately would a manager like to know Φ? I would suggest there is little point in knowing Φ to a precision finer than about σ/3. After all, the true cost is likely to differ by ±σ from Φ. If we obtain R = 12 independent samples from P(x), we can estimate Φ to a precision of σ/√12 – which is smaller than σ/3. So twelve samples suffice.

By “P(x)”, I think he means p(theta|y), by “10 000”, I think he means 10,000, and by “R” I think he means n_{sims} or something like that.

Setting aside the “two cultures separated by a common language” thing, and taking MacKay at his word regarding the priorities of pipeline repair managers, I’m in complete agreement with this reasoning. I’m always telling people they don’t need as many independent draws as they think they need, and I squirm in frustration when I read a paper where the authors proudly announce that they ran 100,000 simulation draws.

My only question is . . . where did the “12” come from in MacKay’s passage? When I first saw the number there, I assumed it would have something to do with the variance of the uniform distribution. But that doesn’t seem to come up at all.

Accepting MacKay’s stipulation that σ/3 would be enough precision in this sort of example, shouldn’t he have said that 9 random draws would suffice? I could see him rounding that up to 10. Or even 16, if he wanted to get all base-2 on us. But why 12? I don’t get this at all.

Free pdf copy of MacKay’s book to whoever gives the best explanation.

[Update: here’s the answer.]

P.S. The close reader will have noticed that, unlike you-know-who, I actually read before copying and pasting. That’s why I inserted the ^ in σ^2 above—the superscript got lost in the copying from pdf.

P.P.S. On that same page MacKay makes the odd-seeming (to me) remark that the tuning parameters of the Metropolis algorithm “are usually set by trial and error with the rule of thumb being to aim for a rejection frequency of about 0.5. It is not valid to have the width parameters be dynamically updated during the simulation in a way that depends on the history of the simulation. Such a modification of the proposal density would violate the detailed balance condition that guarantees that the Markov chain has the correct invariant distribution.”

This passage surprises me for two reasons. First, didn’t MacKay know about 0.234? Maybe that result hadn’t made its way over to the physics literature yet. (The first “0.234” paper came out in 1996, and MacKay’s book is dated 2003. I searched MacKay’s index for Gareth Roberts, but all that turned up was an obscure paper from 1994 on adaptive direction sampling, nothing on acceptance rates of optimal Metropolis and Langevin algorithms. It’s only fair, though, given all the work in physics that we statisticians have ignored over the years, that there be some reciprocity.) The other part that surprised me was MacKay’s dismissal of adapting the jumping kernel. In practice (and in Bugs!) you can adapt all you want during the burn-in period. Or, to put it another way, if you run your chains for 1000 iterations and fail to achieve acceptable mixing, you can feel free to adapt and continue from there, considering those past 1000 steps as burn-in (which is appropriate if you have not mixed well). At some level MacKay recognizes this—he talks about tuning “by trial and error”—but he could be misleading people by suggesting that tuning can’t be done more systematically.

32 thoughts on “David MacKay sez . . . 12??

  1. 12 is a great number. It is the result of multiplying the first even prime number (that we know of) by the first perfect number.

  2. The “It is not valid to have the width parameters be dynamically updated during the simulation in a way that depends on the history of the simulation” seems not to take into account the important 2001 Bernoulli paper by Haario et al. “An adaptive Metropolis algorithm” where the jumps covariance depends on the entire chain history while preserving detailed balance.

    • I was also going to mention this. But Andrew, you said it was OK to do the adapting during burn in, but you didn’t say anything about after burn in. Is there some reason not to do the adapting of the jump covariance matrix after burn in?

      • Chris:

        Yes, you can adapt after burn-in—in fact, NUTS (which we use in Stan) is adaptive. But adapting after burn-in can be difficult. My point is that adapting during burn-in is a freebie.

    • I vaguely remember an article linked from this blog that showed that the detailed balance is a much stronger condition than what is actually needed to guarantee validity of inferences from an MCMC chain but I can’t for the life of me remember which article.

  3. I have read the “12 samples” thing in a “statistics rules of thumb” book, the name of which I’ve unfortunately forgotten already. Maybe McKay had read the same book?

  4. I find that passage totally confusing. Assuming that expected cost = cost function x probability of having a certain level of corrosion (or P(x)). then the variance (“precision”) of expected cost is already not equal to the variance of P(x). If the cost function is a constant k (say), it would be a k^2*the variance of P(x). Also, the variance of the expected cost of repairs seems to be difficult to measure in practice; what are these 12 samples? 12 tubes? 12 tubes needing repairs? 12 tubes over some level of corrosion?

    • In practice the manager doesn’t care about variance in cost, he cares about standard deviation in cost. k^2 factor goes away as soon as we take the square root in your example. After all, no one pays “square dollars” for anything.

      On the other hand, if cost is strongly nonlinear or even has a very high slope near the average, the uncertainty in cost can be much larger than the uncertainty from P(x).

      • Yes, variance = (SD)^2 but that isn’t my point. if you use SD, it is still the case that SD(cost) = k* SD(P(x)) if I’m reading this example correctly. It sounds like he is suggesting estimating SD(P(x)) by looking at a sample of 12 tubes but he explicitly states that sigma is the SD(cost) not the SD(amount of corrosion) so I think he needs to straighten out this example.

        • I see what you’re saying here, I agree he should spell things out a little better, but I suspect he’s intentionally ignoring the constant factor. The cost function is something assumed by the manager, and if you assume it’s locally linear and the SD(x) is not too big, then the SD(f(x)) is a (known) constant times SD(x) so characterizing SD(x) also characterizes SD(f(x)) at least for the purposes of determining the number of samples. It’s not like more samples will improve your estimate of the constant k.

          If your cost function is highly nonlinear then this won’t necessarily work well, but most cost functions are at least monotone and often convex.

        • And my other point is you can’t decide what samples to collect when he isn’t clear (or he’s confused) which quantity he’s estimating. If we’re estimating SD of X, then you’d want a random sample of tubes but if it’s SD of cost, then you need to sample tubes that have been repaired. I guess I get frustrated because as a practitioner, these details are really important. Estimating the wrong SD can change the results a lot.

          (He should also discuss something about the confidence level of these estimates.)

  5. I used to similarly assume that you could not adapt your Metropolis proposals post-burnin. However, this paper by Levine et al (2005) seems to suggest that though the Markov property is violated with adaptation, it still results in convergence (see Theorem 3.1):

    http://bit.ly/uBfrcf

  6. I’ve never found the “0.2” arguments to be all that appealing: the heavy use of asymptotics and computational machinery offers little intuitive sense (and is there a multivariate version yet that includes arbitrary correlations?) and has the air of “proof by intimidation”. On the other hand, MacKay’s cute information theory argument for r = 0.5 (the idea being that the Metropolis-Hastings decision is a 1-bit communication channel, and the information is maximized for a 50% error rate) is much more apparent the average student. Not to say it’s exact, but does it really matter in practice so long as the accept rate is within a few factors of 2 of the optimal rate?

    • Mike:

      Take a look at our papers. 0.234 is a reasonable acceptance rate for a lot of examples, not just asymptotically, and it works fine under correlation. The real point is that in many cases we can do much better than Metropolis by using HMC. But within the world of Metropolis, 0.234 is great, and you can go beyond it by looking at expected squared jumped distance.

      • I totally agree about HMC (in fact, thanks to MacKay I implemented HMC before I ever implemented Metropolis; my thesis work in physics is dense with applications). My point wasn’t that 0.234 doesn’t work, but rather than the difference between 0.234 and 0.5 is pretty small in practice and those kind of pedantic arguments only help to drive apprehensive users away.

  7. Pingback: MacKay update: where 12 comes from « Statistical Modeling, Causal Inference, and Social Science

  8. I’m interested in the comment about Occam’s razor. I have read MacKay’s book – of which I am an enormous fan, and I find his position pretty convincing, and it is backed up with formal reasoning. But disagreements are the best places to learn things, so I followed the link you give, and I find a counterargument at the level of ‘well I think lots of parameters are great, so there’.

    I think you’ll grant that ‘well, I think lots of parameters are great’ isn’t much of an argument, so I would be interested in a more carefully worked out version of the same (I assume such exists) – ideally one that engages directly with MacKay’s formal argument in favour. Any pointers?

    • Sean:

      I think Radford Neal gets it exactly right in the quote that I have from him at the linked blog post. In Bayesian terms, if a larger model performs worse than a simple model, then the larger model needs a better prior distribution. Sure, you can’t estimate thousands of parameters from a little data set using least squares (or the equivalent with a noninformative prior), but who says you have to use least squares or a noninformative prior? To put it another way, I prefer an informative prior that says “theta is probably near zero,” rather than a lower-dimensional model that says “theta is exactly zero.”

      • Maybe I’m not catching what you’re saying. Mackay’s point (which is the same one that Jaynes makes in his book) isn’t about the estimate being bad, and it isn’t about the dimensionality per se. And it isn’t about what you *should* do in setting up your model. It’s about what *will* happen if you include in your inference two composite hypotheses that intersect somewhere, given that the prior is a finite measure.

        To the extent that the high-likelihood parameter values are in the intersection (i.e., “all things being equal”), if one of the composite hypotheses M2 allows for a wider range of parameter values (i.e., is “more complex”), then M2 will have lower posterior probability. “Allows for a wider range of parameter values” could mean that M1 has comparatively low prior variance on theta, or it could mean that it has zero prior variance on theta (i.e. lower dimensional). It doesn’t matter, as long as M2 has higher prior variance on theta. If M2 doesn’t have enough high-likelihood regions outside the region of theta-space where M1 assigns high prior probability, then the posterior probability of M2 (or any particular point hypothesis within it) will be low. It’s the same reason a value of theta that doesn’t assign too much probability mass to unseen points will give higher likelihood. The answer, “you shouldn’t make this model comparison” misses the point. The question is what happens if you *do* make this comparison – and if you look inside a hierarchical model there will generally be some such situation.

  9. Professor Gelman,

    Implicit in this “amount of corrosion” example is that there is at least some corrosion in all of the samples. That is…the prevalence of corrosion is 100%. I have trouble with how these kinds of examples get extended in my domain (and some others) because of this inherent [semi-occult] assumption. That is…if the ‘amount of corrosion” present was zero in 98% of the population…would 12 pipes really be enough to estimate the “amount of corrosion” in the population of 10,000? I mean…thinking of this as a hypergeometric…the most likely outcome of that sample of that size into that population would be “no corrosion present…”

    What am I missing?

    • If the amount of corrosion is zero in 98% of the population, then to the accuracy he assumes is necessary, the amount of corrosion *is* zero, or something close. The point is that precise estimates are often not very helpful to some real world goal, and are usually expensive. Looking at 12 randomly sampled pipes, seeing that none are corroded, and assuming that “essentially none are corroded” seems pretty valid in many cases.

      Where this goes wrong is when even one corroded pipe (or bolt) is enough to cause serious or catastrophic consequences. This occurs in things like old steel truss bridges, where the failure of one connection could bring the whole bridge down. Randomly sampling 12 connections and extrapolating that none of the connections are weakened is not valid in this type of case.

      • Daniel,
        Thanks..the engineering perspective clarified the problem in my head.
        Kinda like looking for cancer in a population that is basically cancer free…No?

  10. “σ measures by how much we should expect the actual cost to differ from the expectation Φ” is a wee confusing in that it seems to represent the mean absolute deviation, E(|φ(x)-Φ|), the original measure of variability introduced by Laplace…

  11. Pingback: David MacKay and Occam’s Razor « Statistical Modeling, Causal Inference, and Social Science

Comments are closed.