Adaptive Metropolis algorithms

Christophe Andrieu gave a talk at IceBugs on adaptive MCMC for Bugs. I wasn’t able to attend the meeting, but the presentation looks reasonable to me. Right now, one of my problems with Bugs is that sometimes it crashes–I think because it’s using fancy stuff like slice sampling. I’d rather see it using the Metropolis algorithm, which never crashes.

The only thing about Andrieu’s talk that I question is the emphasis on “vanishing adaptation”–that is, adaptation where the algorithm still converges to the correct target distribution. This is OK, I guess, but I think any adaptation is OK in the sense that it can eventually stop, and then you’ll get the correct stationary distribution. After all, we need to do some burn-in anyway, so it makes sense to me to adapt for the first (burn-in) half and then simulate from there. (Then, if convergence hasn’t been reached, one can use these last half of simulations to do another adaptation, and so forth.) So I don’t know that any special theory is needed.

(Here’s my own paper (with Cristian Pasarica) on the topic, where we have success adapting Metropolis jumps by optimizing the expected squared jumped distance. Our algorithm is great, but I have to admit it’s pretty complicated, and Jouni and I have not gotten around to implementing it in Umacs.)

Following up on this, and in answer to the question in Section 1.5 of Andrieu’s presentation, yes, adaptation can be done automatically. In fact, we do it in Umacs, adapting both the jumping scale and (for multivariate jumping rules) the shape of the jumping kernel.

P.S. Maybe it’s the adaptive rejection sampling that’s causing Bugs to crash–see Radford’s comment.

9 thoughts on “Adaptive Metropolis algorithms

  1. … one of my problems with Bugs is that sometimes it crashes–I think because it's using fancy stuff like slice sampling. I'd rather see it using the Metropolis algorithm, which never crashes.

    Andrew: I'm unaware of any circumstance in which any of the methods in my slice sampling paper will "crash". If you have any basis for thinking this happens, please let me know. If I recall a conversation with one of the BUGS implementors correctly, they sometimes use slice sampling because their Adaptive Rejection Sampling implementation can crash or hang up.

    Metropolis may never crash, but you can easily get a completely wrong answer without realizing it, as illustrated in my slice sampling paper.

  2. I don't think BUGS uses ARS any more (at least not OpenBUGS). The slice sampler can mis-behave sometimes, but I don't know if that's because of the implementation. I'll pass this on to Andrew T., and see what he says.

    Radford: do you fancy learning Component Pascal?

    Bob

  3. I disagree with both Radford's comment and the P.S. that Andrew added, or else I don't understand them. What does it mean for an _algorithm_ to "crash"? If the algorithm leads to an endless loop, or otherwise fail to terminate, then I guess one can say that the algorithm has "crashed." But normally, we speak of a computer program crashing, and that's a very different concept from an algorithm crashing. Even if I implement a perfectly well-behaved algorithm, I might create a program that crashes (e.g. if the program has a "memory leak" that eventually fills up the available computer memory.)

    Yes, it's _possible_ that BUGS crashes occasionally because it is implementing an algorithm that never terminates, or an algorithm that requires it to store so much stuff in memory that it eventually runs out. But it's also possible, and perhaps more likely that it crashes occasionally because of a programming flaw (or perhaps even operating system flaw).

    It's true that it's easier to avoid such flaws — I am deliberately avoiding the word "bugs", given the context of the discussion — when implementing a simple algorithm than when implementing a complicated one, so I guess there is a sense in which BUGS might be crashing "because it's using fancy stuff". But it doesn't mean there's anything wrong with the underlying mathematical algorithms; it's more likely an implementation problem.

  4. Regarding Phil Price's comment… The sort of problem that can arise with these algorithms is that roundoff error results in a fatal situation that would never occur in the idealized algorithm with infinite precision arithmetic. For instance, an interval (L,R) in which L is supposed to be less than R might end up with L being greater than R after round-off occurs. In a sense, this is a programming bug, but the possibility of such bugs may be greater for some algorithms than others.

  5. There are many version of slice sampling, so it's hard to make totally general statements about how likely an implementation of it is to hang.

    However, the basic "expand interval by stepping out" then "sampling using rejections to shrink the interval" method is pretty simple, and should be capable of robust implementation. I do notice, however, that a less-than in Figure 5 of my Annals paper should really be less-than-or equal, so that the procedure will terminate even if the only acceptable point is the point you started at. (This is actually what I do in my program.)

    If you're worried, you can always impose some large maximum on the number of rejections in the procedure, and just abandon it (leaving the state as is) if this maximum is exceeded. Detailed balance is still satisfied (since detailed balance is obviously satisfied by transitions that don't change the state).

  6. BUGS uses two forms of the Slice sampling algorithms depending on if the support of the condition distribution is on a finite range of the real line or not. If the support is finite BUGS takes the initial slice to be the support of the distribution and draws a uniform from this. If the new point is outside the distribution the slice is shrunk in. If the support of the conditionnal is open at one (or both ends) then BUGS uses Radfords steping out algorithm followed by shrinking.

    The algorithm may fail in two places: the stepping out procedure never manages to bracket the slice or no sample from the slice is accepted. I limit the number of attemps at each stage of the algorithm so that BUGS will give an error message instead of hanging. The problem is how generous should the number of attemps be? If it is too generous it may appear the algorithm has hung.

    I think the stepping out procedure is the more difficult problem. What scale should the stepping out scale used? Too small the algorithm gives an error. Too big and you attempt to evaluate the likelihood at an impossible value and get a floating point exception. So BUGS starts with a small step and monitors the average absolute distance moved over a (say 25 iterations) and then uses this distance to choose the step size. This adaption is allowed for so many iterations and then stops.

    I find the comment that the slice sampler can just stick at the current point interesting. If I understand correctly I could just quit the slice sampling algorithm if the number of attempts in either part of the algorithm was exceeded and just stay at the current point.

    On another point with the slice sampler how do things work with a multimodal distribtion?

  7. Regarding slice sampling…

    The algorithm may fail in two places: the stepping out procedure never manages to bracket the slice or no sample from the slice is accepted. I limit the number of attemps at each stage of the algorithm so that BUGS will give an error message instead of hanging. The problem is how generous should the number of attemps be? If it is too generous it may appear the algorithm has hung.

    The algorithm in my Annals of Statistics paper allows for a limit on the number of steps out, which must be randomly allocated to the two directions to ensure validity. So there shouldn't be any need to give out error messages – just put in some fairly large limit and proceed onwards even if it's reached.

    What scale should the stepping out scale used? Too small the algorithm gives an error.

    A really small interval will round off to zero, but with care, that just means you end up staying at the current point. Of course, this better not happen all the time, or you make no progress.

    Too big and you attempt to evaluate the likelihood at an impossible value and get a floating point exception.

    I don't see why that should be. You should of course be working with the log of the density, not the density itself, and if even that overflows, things should still work out OK when using IEEE floating point. This assumes the log density of the current point doesn't overflow, but if it does, no method is going to work.

    I find the comment that the slice sampler can just stick at the current point interesting. If I understand correctly I could just quit the slice sampling algorithm if the number of attempts in either part of the algorithm was exceeded and just stay at the current point.

    My comment refers only to the shrinkage part of the algorithm. For the expansion part, you have to divide up your limit randomly, as in Figure 3 of the paper.

  8. So it looks like the slice sampler can be made to never fail…

    I am trying to implement the Stable distribution in OpenBUGS via the auxillary variable method of Buckle. I have decided to use slice sampling for the alpha, beta, gamma and delta parameters of the Stable. It is possible to prove that some of these conditional distributions are multimodel (you can calculate the points at which the distribution has zeros). I am not shure that I am getting sensible results for all the parameters. How does the slice sampler behave if the conditional distribution has say 50 or 500 or 10000 zeroes?

  9. I am trying to implement the Stable distribution in OpenBUGS via the auxillary variable method of Buckle. I have decided to use slice sampling for the alpha, beta, gamma and delta parameters of the Stable. It is possible to prove that some of these conditional distributions are multimodel (you can calculate the points at which the distribution has zeros). I am not shure that I am getting sensible results for all the parameters. How does the slice sampler behave if the conditional distribution has say 50 or 500 or 10000 zeroes?

    I take it that by "zeros" you mean modes. The slice sampler doesn't provide any magic solution to multimodality. In low dimensions, you should choose a big interval width, even if that means you usually have to shrink a lot, since that gives you a better chance of jumping to another mode. Updating one parameter at a time may be bad, since maybe you can't jump from one mode to another that way – the multivariate version of slice sampling with hyperectangles might work better. In high dimensions, solving difficult multimodality problems is going to require something much more elaborate (simulated tempering, tempered transitions, etc.)

Comments are closed.