(Markov chain) Monte Carlo doesn’t “explore the posterior”

[Edit: (1) There’s nothing dependent on Markov chain—the argument applies to any Monte Carlo method in high dimensions. (2) No, (MC)MC is not not broken.]

First some background, then the bad news, and finally the good news.

Spoiler alert: The bad news is that exploring the posterior is intractable; the good news is that we don’t need to explore all of it to calculate expectations.

Sampling to characterize the posterior

There’s a misconception among Markov chain Monte Carlo (MCMC) practitioners that the purpose of sampling is to explore the posterior. For example, I’m writing up some reproducible notes on probability theory and statistics through sampling (in pseudocode with R implementations) and have just come to the point where I’ve introduced and implemented Metropolis and want to use it to exemplify convergence mmonitoring. So I did what any right-thinking student would do and borrowed one of my mentor’s diagrams (which is why this will look familiar if you’ve read the convergence monitoring section of Bayesian Data Analysis 3).

First M steps of of isotropic random-walk Metropolis with proposal scale normal(0, 0.2) targeting a bivariate normal with unit variance and 0.9 corelation. After 50 iterations, we haven’t found the typical set, but after 500 iterations we have. Then after 5000 iterations, everything seems to have mixed nicely through this two-dimensional example.

This two-dimensional traceplot gives the misleading impression that the goal is to make sure each chain has moved through the posterior. This low-dimensional thinking is nothing but a trap in higher dimensions. Don’t fall for it!

Bad news from higher dimensions

It’s simply intractable to “cover the posterior” in high dimensions. Consider a 20-dimensional standard normal distribution. There are 20 variables, each of which may be positive or negative, leading to a total of $latex 2^{20}$, or more than a million orthants (generalizations of quadrants). In 30 dimensions, that’s more than a billion. You get the picture—the number of orthant grows exponentially so we’ll never cover them all explicitly through sampling.

Good news in expectation

Bayesian inference is based on probability, which means integrating over the posterior density. This boils down to computing expectations of functions of parameters conditioned on data. This we can do.

For example, we can construct point estimates that minimize expected square error by using posterior means, which are just expectations conditioned on data, which are in turn integrals, which can be estimated via MCMC,

$latex \begin{array}{rcl} \hat{\theta} & = & \mathbb{E}[\theta \mid y] \\[8pt] & = & \int_{\Theta} \theta \times p(\theta \mid y) \, \mbox{d}\theta. \\[8pt] & \approx & \frac{1}{M} \sum_{m=1}^M \theta^{(m)},\end{array}$

where $latex \theta^{(1)}, \ldots, \theta^{(M)}$ are draws from the posterior $latex p(\theta \mid y).$

If we want to calculate predictions, we do so by using sampling to calculate the integral required for the expectation,

$latex p(\tilde{y} \mid y) \ = \ \mathbb{E}[p(\tilde{y} \mid \theta) \mid y] \ \approx \ \frac{1}{M} \sum_{m=1}^M p(\tilde{y} \mid \theta^{(m)}),$

If we want to calculate event probabilities, it’s just the expectation of an indicator function, which we can calculate through sampling, e.g.,

$latex \mbox{Pr}[\theta_1 > \theta_2] \ = \ \mathbb{E}\left[\mathrm{I}[\theta_1 > \theta_2] \mid y\right] \ \approx \ \frac{1}{M} \sum_{m=1}^M \mathrm{I}[\theta_1^{(m)} > \theta_2^{(m)}].$

The good news is that we don’t need to visit the entire posterior to compute these expectations to within a few decimal places of accuracy. Even so, MCMC isn’t magic—those two or three decimal places will be zeroes for tail probabilities.

69 thoughts on “(Markov chain) Monte Carlo doesn’t “explore the posterior”

  1. But we still want to “explore the typical posterior set”, isn’t it? I think conceptually it is helpfull to better grasp MCMC methods. If I didn’t misread, this is the main argument Michael Betancourt uses to advance Hamiltonian Monte Carlo, for example.

    • This is exactly what I think is misleading. What does it mean to explore the posterior’s typical set? The typical set in a multivariate normal runs through all the orthants, but we can’t visit them all. So there are going to be huge regions of the typical set we don’t explore.

      P.S. Michael’s pretty careful about focusing on expectation computations. But it’s easy to slip into this informal notion that you’re doing a good job of exploring the typical set, when in fact it’s impossible.

      • > So there are going to be huge regions of the typical set we don’t explore.

        How are those “unexplored regions” defined? In wht sense are they “huge”?

        For a bivariate normal you can also partition the plane in billions of angular regions and say that most of them are left “unexplored”.

        • The unvisited regions are “huge” in the only sense that matters—they represent most of the probabilty mass. In the normal example, the mass is distributed equally among the orthants and Monte Carlo methods can only visit a small fraction of them.

          The difference in low dimensions is that you actually can explore all of the high volume regions in the sense of having draws close (in Euclidean distance) to every high probability mass region of the posterior.

        • The unvisited regions are equally huge in the low-dimensional case if I divide the space in a large number of regions. You’re dividing a 20-dimensional hypersphere in 1’048’576 hyperoctants, and you say that if you sample 1000 points from the hypersphere most of those regions are not going to be visited. But you could also divide a 3-dimensional sphere in a million regions and reach the same conclusion.

          What this example illustrates is that grid methods break in high dimensions, because even the coarsest hypergrid possible results in too many regions. The trap to be avoided is to think that exploring the volume is synonymous to visiting every compartment in the grid.

          It’s true that any random point in the high-dimensional hypersphere will be far from any of the sampled points, because every random point is equally far from every other random point in high dimensions. This is of course different from the low-dimensional case, where there is a non-degenerate distribution of distances and the closer point will be closer than the average distance.

        • > because every random point is equally far from every other random point in high dimensions

          To clarify, that “equally far” this has to be understood in the usual asymptotic sense that for D going to infinity, the probability of the distance between two random points being outside the interval [ typical distance – epsilon , typical distance + epsilon ] goes to zero for any epsilon (or something like that).

        • Good point. The 9,999,000 regions you don’t visit in that 3D space make up most of the probability mass. Your last point’s the one I was trying to make about getting close in lower dimensions, and yes, it’s just a consequence of how space scales between random draws in high dimensions.

          What I’m trying to get at is that we don’t want to think of Monte Carlo methods as being like grid (quadrature) methods.

  2. > It’s simply intractable to “cover the posterior” in high dimensions.

    I think that you make it seem more mysterious than it is. In low dimensions it’s also impossible to “cover the posterior” in a literal sense because the volume of the N points doing the “coverage” is zero.

    “Covering the posterior” means that these points are “all over the posterior”, in the sense of “covering” as much as can be “covered” with N points, and this understanding carries well from low dimensions to high dimensions.

    • See the previous reply—the coverage is in terms of distance to all the high probability mass regions. For instance, in a 2 dimensional isotropic normal (for simplicity), most of the probability mass is in a disk of radius 2 around the origin and almost all of it is within a disk of radius 3. With a few thousand draws, you’ll get a point near every point in that region with Monte Carlo methods. That property goes away in high dimensions.

      The relevant sense in which they are all over is that they’re sufficient for computing expectations.

  3. One thing that I think can be confusing about this approach to exposition (which I think is extremely helpful overall) is the ambiguity about what is meant by “taking expectations with respect to the posterior”. Generally speaking, taking an expectation is about collapsing a distribution into a scalar (e.g. expected utility theory). However, our goal with MCMC is firstly to characterize a posterior *distribution*. So we generally want a distribution for P(y_rep|rep), for instance, and not just a scalar quantity. This is related to the distinction between the prior predictive *distribution* and the marginal likelihood, something which is initially confusing when you just see notation written down.

    • By expectations with respect to the posterior, I mean conditional expectations of the form $latex \mathbb{E}[f(\theta) \mid y]$ where $latex y$ is data and $latex \theta$ is a vector of parameters. Almost everything we want to compute is in this form other than quantiles.

      However, our goal with MCMC is firstly to characterize a posterior *distribution*.

      This is what I see people writing and what I was objecting to. But maybe I don’t understand what authors mean by characterizing the posterior distribution. We clearly can’t sample from it so that we get draws in all the orthants. But we can sample from it enough to compute marginal quantiles or expectations.

      Hey, what happened to MathJax?

      • What do you think of making histograms or density plots of parameters with our posterior (MCMC) draws? I guess that is what I have in mind when I say “characterize a posterior distribution”. Sure, the high-dimensional target is barely covered by the sampling, but our marginals are often pretty well represented by this sampling, no? Although I admit that you could equally think about this *distribution* as defined by moments (expectations) and/or quantiles, and in that sense we are taking expectations with respect to the posterior distribution. I am not disagreeing with the correctness of your approach, just trying to express why I think it can be confusing or counter-intuitive until you really think about it. In my mind, expectations convert distributions to scalars, and the distinction between distributions and scalars is critical.

        • In some sense a marginal histogram bin is an expectation with respect to a subset of N-1 of the variables right? What’s the probability that q[i] falls between start_of_bin_j and end_of_bin_j integrated over all the other i values?

          But I agree it takes a kind of sophistication to understand that that’s what a marginal distribution does.

        • Exactly. We barely cover the high-dimensional target in the sense that there are lots of regions of high probability mass we never get near, but we cover it sufficiently to compute expectations. Those can involve marginals (parameter estimates), but they can also involve computing non-linear functions of the entire posterior (predictive densities).

          And thanks. I’m running this idea up the flagpole of Andrew’s blog, which is super helpful in clarifying my own thinking and presentation.

  4. Tierney and Kadane (1986) have a paper (Accurate Approximations for Posterior Moments and Marginal Densities) where they make what I interpret to be a similar argument but using Laplace approximations- basically that if you have a very large vector of latent variables, but are interested in a low-dimensional function of them and it involves an integral (like E()), then Laplace approximations still work nicely like they do in low dimensions. I wonder if it’s the same idea, or works because of some same underlying principle?

    • One other interesting thing about using Laplace approximations is that because errors are relative, not absolute, you can approximate small things well, like tail probabilities.

        • My understanding is that you have a guarantee like est/true = 1 + O(1/n), so it doesn’t suffer from the true quantity being small; contrasted with methods which have additive error rates of est = true + O(something). The “n” there is any parameter which makes the function more peaked around its mode as it gets larger; so in stats, a sample size, if the function is a likelihood or a posterior. Based on this, I think that your comment is correct, but that the “approximately” part is pretty generous in finite samples, because of the O(1/n) rate. Not sure though? I am new to this material myself!

  5. Bob:

    1. Great post, and it’s great to see some comments on a technical topic, good to know that people will comment on something other than p-values (which might be considered the “Israel/Palestine topic” of statistics blogging).

    2. If, based on your post, “Markov chain Monte Carlo doesn’t explore the posterior,” would it be fair to say that no random sampling explores the posterior? Even simple random sampling doesn’t explore the posterior?

    If so, we could make the post title more general by removing “Markov chain” from the title. This is not just pedantry; I think it’s important to distinguish between difficulties with MCMC, and difficulties with Monte Carlo methods in general. We discuss some of this in section 10.5 of BDA3.

  6. The thing that’s amazing about finite measure spaces (probability) as you get into high dimensions is that basically every high dimensional function evaluated in the region of high probability is nearly constant. So for example if you have a prediction function

    y = f(x,a,b,c,d,e,f…z) and x is a given dependent variable you want a prediction for, then you’re going to predict “basically the same y” no matter what the values of the parameters a,b,c…z provided they’re sampled from the posterior (another way of thinking of this is that there will be some compact “sort of normal” distribution of y, not some terrible distribution with deep tails and multi-modal little separated islands of probability etc)

    Now obviously this isn’t true if most of those parameters don’t affect the prediction at all, and one single parameter does all the work, or if all the parameters are highly correlated, but for usual circumstances, that’s not the case.

    In other words, if we only have 25 samples of the {a,b,c..z} vector we’re going to see “most of the action” in the variability of the prediction y, because other ways of sampling would have given a similar range of y because y is “kind of a constant” in the sense that Normal(0,1) is “kind of like 0”

    This is also true for functions like “the average of {a,b,c,..z}” and “the utility associated to the prediction that {a,b,c..z} gives” and “the probability that {a,b,c,…z} is less than {1,2,7,11,…}” and many other functions of the {a,b,c…z} vector.

    In fact, “the sample average of the stan samples” is precisely one of those functions of many random variables… so it’s going to give basically the same answer regardless of which small slice of the posterior Stan gets into.

    If this all weren’t the case, we’d never get anywhere at all with sampling.

  7. I think this post highlights the difference between the Riemann approximation and Monte Carlo methods. For the Riemann approximation, we are required to “fully explore the typical set” and therefore cannot overcome the dimension curse. For Monte Carlo methods, the root n convergence rate is independent of the dimensions.

    That said, I think the description of that 20-dimensional normal example is not extremely fair, which is equivalent to a uniform distribution in that 2^20 discretized space. Sure, Riemann approximation will be dead in a high dimensional uniform. But in practice, it can still work if there is some sparsity conditions. For instance, if I change that 20-d N(0,1) to a 20-d N(3,1), then nearly all samples are just expected to sit in 1 cell of that 2^20 possible regions. It is so sparse.

    • I agree, this illustrates how grid methods may be unusable in high dimensions while sampling methods can still be used to “explore” the space. I wouldn’t say that “exploration” requires that every point in a grid is visited, in geographical terms that would be more like “surveying”.

      Explore: travel through (an unfamiliar area) in order to learn about it

      Survey: examine and record the area and features of (an area of land) so as to construct a map, plan, or description

    • I was just using orthants for convenient conceptualization and computation. The same argument works if you translate, scale, and rotate the posterior. The axes just change position and orientation. And it still works if the posterior’s banana shaped, or whatever. You just can’t explore both ends of the range of each parameter in every dimension.

  8. Can you calculate theoretical indicators? For instance, real log canonical threshold.
    Are MCMCs not really working?

    I think it is well known that there is a limit to the range (or set) that can be reached by finite sampling, since it is difficult for MCMCs (except for replica exchange MCMCs) to calculate the marginal likelihoods, the evidence, or the free energy.

    • Nh:

      MCMC works just fine. As Bob noted, there’s only a finite amount you can learn from a finite number of MCMC samples, but then again there’s only a finite amount you can learn from a finite number of straight Monte Carlo samples too.

      • Andrew:

        What do you mean “MCMC works just fine”?
        I think that MCMC can approximate the posterior distribution if MCMC works fine.
        To my knowledge, distribution approximation/characterization methods using sampling always suffer from the finite sample size.
        (That is why mathematical/theoretical researches and methods are needed.)
        Of course, MC and MCMC face to this problem.

        The accuracy of the posterior distribution that MCMC characterizes can be found by comparing it with the theoretical value of the (expected) generalization error (KL divergence between the source of data and the predicted distribution). A real log canonical threshold is the leading term of its theoretical index, when the posterior or likelihood cannot be approximated by any (multivar.) normal distribution.

        • It works just fine in that when there’s geometric ergodicity, you get an MCMC central limit theorem which gives us a way to estimate standard errors from (MC)MC estimates. Certainly we need more research to see when model + sampler is geometrically ergodic, which is why in practice, we need so many diagnostics.

  9. What are the implications for decision analysis ?

    An example : In my field (medicine), all the rage is on “efficiency studies”, where one tries to estimate the relative efficiency (a. k. a. cost/efficacy) of an “experimental” technique (drug, intervention, surgery, whatever…) over a “reference” (a. k. a. “control”) technique by computing the QALY (“Quality-Adjusted Life Years” : don’t get me started on this…) observed in each group as well as the costs (translated in monetary units by “valorisation” : again, a can of worms I won’t dive into…) and computing an average (difference of costs)/(difference in QALYs). The data should, ideally, come from a prospective, *randomized* study.

    I have several problems with this approach, one of them being that I’m not convinced that the ratio (which *has* a distribution) has a mean, nor that this mean, if it exists, can be estimated with any known degree of precision. I thought that a better solution would be to:

    * estimate the *joint* distributions of efficacies and costs in each group, conditional to the covariates ;

    * sample, for some given value of the covariates (sampled from some independent estimate of their distributions in the population of interest), (cost in experimental group – cost in control group)/(QALYs in experimental group – QALYs in control group), and

    * conduct a decision analysis based on the sampled distribution of this ratio, using some (ideally prespecified) loss function.

    What are the implications of the combinatorial explosion you underscore for the validity of the decision analysis ?

    • Emmanuel:

      Yes, ratios can be difficult to interpret. See discussion here from a few years ago, where discuss problems with the incremental cost-effectiveness ratio. Also the LD50 example in chapter 3 of BDA, where again we talk about how the ratio is not directly interpretable if the denominator can change sign.

    • You need to look at the quantity of interest you’re trying to calclulate. In most cases in Bayesian analysis, it’s a posterior expectation of some sort that you can calculate. But sometimes that expectation won’t exist. Your example of the ratio (more generally “quotient”) is just such a case. Even if the two quantities being divided are well-behaved and have normal distributions, their quotient will have a Cauchy distribution, which presents difficulties for stable estimation (like samples having unbounded variance!).

      • Dear Bob,

        AFAICT, the Cauchy case happens when the central parameters of both distributions are zero. In other cases, the distribution of the quotient is more involved ; the literature (see below) gives some results at least for the normal/normal case (correlated or not).

        I have been unable to convince myself that those monstrosities had a mean, or even a characteristic function (or Fourier transform), by my abilities in analysis are at least 30 years old…

        A bit of numerical experimentation let me think that the ratio of two normals /may/ have a mean in the denominator’s mean is “large enough” (but, interestingly, not to the point where sampling “negative” points is negligible). I d not (yet) undestand why.

        A few relevant papers (that I *won’t* pretend to have mastered…) :

        1. Geary RC. The frequency distribution of the quotient of two normal variates. Journal of the Royal Statistical Society. 1930;93(3):442–446.
        2. Curtiss JH. On the distribution of the quotient of two chance variables. The Annals of Mathematical Statistics. 1941;12(4):409–421.
        3. Creasy MA. Limits for the Ratio of Means. Journal of the Royal Statistical Society Series B (Methodological). 1954;16(2):186‑94.
        4. Marsaglia G. Ratios of normal variables and ratios of sums of uniform variables. Journal of the American Statistical Association. 1965;60(309):193–204.
        5. Hinkley DV. On the ratio of two correlated normal random variables. Biometrika. 1969;56(3):635–639.
        6. Hayya J, Armstrong D, Gressis N. A note on the ratio of two normally distributed variables. Management Science. 1975;21(11):1338–1341.
        7. Katz D, Baptista J, Azen SP, Pike MC. Obtaining confidence intervals for the risk ratio in cohort studies. Biometrics. 1978;469–474.
        8. Pham-Gia T, Turkkan N, Marchand E. Density of the ratio of two normal random variables and applications. Communications in Statistics—Theory and Methods. 2006;35(9):1569–1591.

        • Suppose you have a ~ normal(0,1) and b ~ normal(1e6,1). There is a nonzero probability that b is arbitrarily close to 0. But the probability is so small that you could never in a million years of sampling get that value…. More to the point, the deep tails of the normal(1e6,1) distribution are not in any sense *actual model*. You would never in a million years of sampling distinguish the difference between if I sampled normal(1e6,1) vs just truncated it to the range “b > 10”. And the version with b > 10 is very obviously well behaved…

          So while the question is interesting in a pure math sense, I’d say in a mathematical modeling of the real world sense when 0 is deep in the tail, you should imagine that your distribution is truncated away from 0 anyway since you don’t “truly believe” in the deep tail behavior anyway.

        • Emmanuel:

          The ratio of two normals has no expectation. But, as I discuss in my post and book section mentioned above, it’s not at all clear when or why we’d ever be interested in the ratio of two random variables where the denominator is of uncertain sign. I think it’s a math problem untethered from applied relevance. The scholarly literature is full of this sort of thing!

        • Agreed that it’s not a hugely important applied problem, but because I haven’t done any hard math in a while I’m investigating the question of whether the ratio of two normals has an expectation or not when the denominator has a mean value positive and far from zero (it’s sufficient to fix the sd = 1 for my interests).

          Do you have an actual reference for the statement “The ratio of two normal has no expectation” or is that meant only for the cauchy normal(0,1)/normal(0,1) (where it’s definitely true)?

          Numerical investigations of an ugly symbolic expression arrived at using maxima indicate that for normal(20,1)/ normal(10,1) the distribution seems to be ok, and has a mean about 2.0206 for example.

        • Example of how this could arise in actual practice. you have some positive values a and b which you can measure with noise, and you are interested in their ratio.

          you have a system that measures the two of them and calculates the ratio, and then transmits it to you. You want to do inference on the ratio.

          The underlying quantities may well be understood to be say both positive, but the *noise* is annoying.

        • Daniel:

          It’s just math. If there’s a nonzero probability that the denominator can be arbitrarily close to zero, then the expectation integral does not exist. From a purely mathematical point of view, it doesn’t matter if the denominator is normal(0, 1) or normal(1, 1) or normal(10, 1) or whatever.

          Look at it this way: suppose the denominator has a distribution that is far from zero so there’s only a 1e-10 chance that it can be in the range, uhhh, I dunno, (-1e-10, +1e-10). Then, for that little window, the denominator looks something like a uniform in that tiny range, i.e. it looks just like the unit normal, albeit scaled by 1e-20. But what’s a factor of 10^20 when infinity is concerned!

          In practice, in that example a simulation would never reach the zero zone so this would not come up; there’d be no computational instability. But, mathematically speaking, there’s no expectation.

          But who cares about the expectation in this case anyway? Ultimately there should be some applied goal, and E(X/Y) won’t answer it.

          I became aware of this marginal cost-effectiveness ratio about 20 years ago and at the time was thinking of writing a paper to explain the problem. Maybe I still should . . .

        • I’m not convinced yet. We both agree that the cauchy has a density. And that it has no expectation integral. The reason it has no expectation integral is that the density doesn’t fall off fast enough as r goes to infinity. That doesn’t necessarily have to be true for the asymmetrical case. I mean, it isn’t obvious to me.

          As far as who cares, well, sometimes I just like to sharpen my math skills, and this is a case where I could do that. I agree with you in applied settings it seems like a generally bad idea…

          But I’ve also set up situations where I might have two parameters and its the ratio of these parameters that determines the behavior of some function. I’d like to know if I’m accidentally creating an ill-posed model that only works because sampling doesn’t get deep into the tails. So I think the question is something worth knowing the answer to.

          Incidentally, doing some work with maxima gives me numerical answers that contradict the idea that there’s no expectation. But I’m still working on it ;-)

        • Daniel:

          Forget about the Cauchy. Just consider Y/X, where X and Y are independent, and there are some positive values of delta and epsilon so that Pr(|X| < delta) = epsilon, and X has an approximate uniform distribution in the range (-delta, delta). Then E(Y/X) will be undefined. After all, that's what's making the Cauchy blow up, when X ~ normal(0, 1). All the rest of the distribution is bounded so is irrelevant for the purpose of determining whether E(Y/X) exists.

        • Andrew,

          > ” became aware of this marginal cost-effectiveness ratio about 20 years ago and at the time was thinking of writing a paper to explain the problem. Maybe I still should . . .”

          You **definitely** should. This incremental cost effectiveness ratio” is used by sanitary authorities all around the world to make decisions about the use of new health technologies. It is even compared to some informal standard, such as “don’t use if ICER>30 000$/QALY”…

          Demonstrating the problems bound to this use with

          * enough mathematical rigor as to be undisputable, **AND**

          * in a language simple enough for decision makers (who often have to take off their shoes in order to count beyond 10…)

          is something that should be done “in the public interest”. It is also quite hard to do (I was struggling on the “rigor” aspect…).

          The problem is even compounded by the fact that most health technologies “advances” are small, and hard to “demonstrate” by NHST (a problem you have yourself demonstrated, IRC…).

        • Emmanuel:

          I know I have forgotten a lot about this but I recall ICER being supplanted by Net Monetary Benefit “The use of NMB scales both health outcomes and use of resources to costs, with the result that comparisons without the use of ratios (such as in ICER).” https://www.yhec.co.uk/glossary/net-monetary-benefit/

          But specifying a more sensible parameter of interest is just the start of the real challenges.

          From my unpublished 2008/9 poster
          Heath care economic evaluations necessarily involve bivariate functions of costs and effects and ideally the use of bivariate rather than univariate models. We anticipate that there are five key components that need to be resolved for any perceptive modeling of costs and benefits using non-randomized data. 1. Assuming the “least wrong” bivariate distribution? 2. Specifying “least inaccurately”, the systematic effects of covariates? 3. Getting “least confounded” comparisons? 4. Choosing amongst the “least invalid” and “least uninteresting” average treatment effectiveness (ATE) estimates? 5. Displaying all the components of the modeling “least opaquely”?

          I could send you the old poster but maybe the references will suffice:
          O’Hagan A and Stevens JW. A framework for cost-effectiveness analysis from clinical trial data. Health Econ 2001
          Nixon RM and Thompson SG. Methods for incorporating covariate adjustment, subgroup analysis and between-centre differences in cost-effectiveness evaluations. Health Econ 2005
          Berger JO, Liseo B and Wolpert R. Integrated Likelihood Methods for Eliminating Nuisance Parameters. Statistical Science 1999
          Rubin DB. Using Multivariate Matched Sampling and Regression Adjustment to Control Bias in Observational Studies JASA 1979
          Imbens G. Better LATE Than Nothing: Some Comments on Deaton (2009) and Heckman and Urzua (2009). http://www.economics.harvard.edu/faculty/imbens/files/bltn_09apr10.pdf

        • I think the issue with incremental cost effectiveness is really interesting but overnight I’ve been thinking about the ratio of normals issue… so I’ll post that at the bottom of the page.

          Mean time, Emmanuel Charpentier, if you email me I’d be happy to hear more about the issue, how it’s used, and soforth, and maybe give some ideas about how to compare it to the measure Keith mentions, and/or alternative measures. I’m very much in favor of Keith’s point about the cost and benefit being bivariate, and we should quantify them separately, but ultimately when it comes to decision making we need a univariate utility, and how to construct one is an interesting question that really needs further explication to decision makers I think.

        • Also, considering the delta(Cost)/delta(QALY) issue… Suppose we *know* that delta(QALY) is positive for some reason, but we don’t have a good characterization of its exact size. Suppose that our real object of interest is delta(QALY_AVG) where QALY_AVG is the average QALY in a large population. We arrive at this from some Bayesian analysis. The average is of course a sum of many individual contributions, each of which may be quite non-normal, but summed over say 10,000 or 1M people it is very reasonable to model our posterior distribution for delta(QALY) as

          normal(QM,sd) truncated to [0.001,inf]

          because we know it’s positive somehow. Furthermore suppose QM/sd is not enormous, like say 8 or so (we think the average delta(QALY) is 8 years +- 1). The fact that the distribution is truncated makes a difference of about 6e-16 in the normalization constant, which is for all purposes machine epsilon, so there’s no calculable difference in the normal density.

          What kind of distribution does the delta(cost)/delta(QALY) have, and how does its MCMC sample average relate to anything? Note how if it “truly” has a normal distribution, the sample average doesn’t relate to a “true” average because there doesn’t exist a true average as Andrew has managed to convince me.

        • “I know I have forgotten a lot about this but I recall ICER being supplanted by Net Monetary Benefit”

          correct me if I’m wrong, but pharma has been operating on a primitive criterion for rather a long time: “if my stuff saves your life, you pay me your NPV of future earnings”. pharma has been chasing Orphan Drug status for a least a decade, seeking extended economic rents. IOW, real decisions are being made with a hammer and anvil.

      • > It’s not clear to me why you’d compute (delta Cost) / (delta QALY), [ … ]

        That’s not clear for me either, but this “incremental cost-effectiveness ratio” IS the Holy Grail of medico-economics nowaday. Notwithstanding the fact that it behaves as the Holy Grenade of Antioch…

        It has a point, however : a differential indicator is usually to be preferred to a difference of indicators when there are a lot of variatio causes (iun other wod : it’s easier to know if A is “better” than B than to know if either one is “good”).

        • Well lots of people are confused about mathematical modeling and stats… so it could just be crap. Suppose delta(cost) = 0 this ratio is entirely insensitive to improvements in QALY… You could triple people’s lifespan at zero cost and it wouldn’t detect anything.

          as far as I’m concerned, total bogusness

      • (delta Cost)/(delta QALY) is used because QALYs are only unique up to increasing linear transformations. Comparing u(x)/Cost(x) and u(y)/Cost(y) is meaningless because the preferences represented by the utility function u() are also represented by any increasing linear transformation of u(), say v(), but the differences (and order) of the u()/cost() ratios are not preserved by the v()/cost() ratios.

        • I don’t follow, sure you can transform the QALY by multiplying by a positive number (increasing linear transformation) and still have a meaningful quantity, but as long as you’re transforming both QALY values equivalently, this doesn’t matter:

          C*Q(a)/cost(a) > C*Q(b)/cost(b) implies Q(a)/cost(a) > Q(b)/cost(b)

        • Yes, multiplying by a positive number does not change anything. But v(x) = a + b*u(x), where b>0, is an increasing linear transformation of u(). v() and u() represent the same preferences but u(x)/cost(x) > u(y)/cost(y) does not, in general, imply v(x)/cost(x) > v(y)/cost(y) if a!=0. Qalys are measured on an interval scale; not a ratio scale.

          “also delta(cost)/delta(QALY) is not even a well defined number since it’s possible to have delta(QALY) = 0, same for delta(cost) so you can’t even work with delta(QALY)/delta(cost)”

          Are you saying that no ratio, whose denominator can be zero, can be of any use? It is not hard just to check what the costs are if delta(qaly)=0.

          “cost is an always positive quantity, so QALY/cost is a well defined number.”

          No qaly/cost is not well defined, in the sense that comparing such ratios is meaningless. (Also, I am not sure that that cost is always a positive quantity. I am no expert in cost-utility analysis so I don’t know if there are any practical examples of non-positive costs but I think that theoretically costs could be 0 (for example for no treatment) or positive (if the treatment option has positive societal side effects).)

        • also delta(cost)/delta(QALY) is not even a well defined number since it’s possible to have delta(QALY) = 0, same for delta(cost) so you can’t even work with delta(QALY)/delta(cost)

          cost is an always positive quantity, so QALY/cost is a well defined number.

  10. I was curious to see what was the origin of the concept of exploring the posterior distribution and I searched some variants in Google Scholar (“posterior distribution exploration”, “explore the posterior distribution”, “exploring the posterior distribution”, “exploration of the posterior distribution”, “posterior distribution is explored”, “exploring posterior distributions”).

    It seems that, leaving aside a couple of previous occurrences in differnt contexts, it was Luke Tierney who introduced the term in the literature with his “Markov Chains for Exploring Posterior Distributions” paper (published in 1994 but circulating as a technical report since 1991): https://projecteuclid.org/euclid.aos/1176325750

  11. “Even so, MCMC isn’t magic—those two or three decimal places will be zeroes for tail probabilities.”

    Will they be less than .05? I kid, I kid.

  12. I think you’re not thinking of “exploring the posterior distribution” in the way that is intended (by me, at least, when I use the phrase).

    Let’s look at an “easy” example. We have a distribution with density f(x), and are interesting in estimating the expectation of a(x), where a(x) ranges over [0,1], with a moderately-small absolute error tolerance – say of 0.01. Given the 1/sqrt(n) convergence rate for Monte Carlo estimates, we need about 10,000 points if we draw them independently from f(x). This is true regardless of how high-dimensional x is, or how complex f(x) is. The distribution could, for example, have 100,000 modes, scattered around a huge, high-dimensional space. But as long as we can draw points independently from this complex distribution, 10,000 points will be enough, despite the fact that we’ll “visit” less than 10% of the modes.

    Now, what about MCMC? 10,000 points won’t be enough if they are highly correlated. Technically, this means if a(x) for these points is highly correlated, but in practice, unless a(x) has some known special structure, to ensure lack of correlation for a(x) we need to limit the dependence between the points x themselves. We need, for example, to be sure that if there are 100,000 modes, with equal masses, we aren’t looking at points from a run in which, given our initial state, some modes had a much higher probability of being visited than others. That is, we need a run that has “explored” the posterior distribution. But if we have explored the distribution, so there is low dependence between points, and hence all modes have nearly equal probability of being visited, then we’re fine – even though most modes will in fact not be visited. This is because when projected to the one-dimensional, bounded range of a(x), these modes become highly overlapped, and it doesn’t matter which of the many similar modes one sees. (Actually, more complex scenarios are possible, but they all work out fine if a(x) is in [0,1] and we have an absolute error tolerance of 0.01.)

    • It’s just that subtlety in the use of “explore” that I was trying to get at. That is, how our (MC)MC estimates can converge without getting near most of the posterior probability mass with any of their draws.

      I sort of like the term “surveying” that Carlos Ungil uses above, but that’s far too loaded a term to use in a statistics context.

      The multimodal case is a great thought experiment. With independent draws, the CLT is enough to give us an estimator error scale of sd(Y) / sqrt(N) where sd(Y) is the standard deviation of the variable Y being estimated (for event probabilities, Y is an indicator function). The thought experiment I always see with quadrature (that also applies to simulation-based methods) is of a lake that is relatively shallow except for a big hole in one spot that has half the lake’s water. So as not to step on anyone’s copyright toes, here’s a picture, where the water is between the dashed lines, there’s a boat on top of the water dropping a plumb line to measure depth.

        \---/.
      -------.----------------------------------
             .
      ---------------------   ------------------
                           | |
                           | |
                           | |
                           | |
                           | |
                           | |
                           | |
                           | |
                           | |
                           | |
                           | |
                           | |
                           | |
      		     | |
                            -
      

      If Y is depth of lake, this just indicates sd(Y) is high. It doesn’t change the rate of convergence, right?

      As Andrew pointed out, the Markov chain part of MCMC was a red herring. Sorry about that. Maybe another post on geometric ergodicity and MCMC CLTs soon! I’m still trying to get my head around the material and how to present it.

      • Well, I don’t think this situation meets the proper requirements for usual convergence theorems for many things you might want to calculate (like for example if you’re calculating pollutant concentrations). This is because “half the water” is in the one tiny hole. Most of the convergence theorems require that things don’t depend *too much* on any one value.

        • The CLT just requires finite mean and variance and i.i.d. sampling. So this case qualifies as everything’s finite. Of course, in practice, you won’t be able to estimate the variance of the sampler reliably, so you’ll probably wind up underestimating standard error and the amount of water in the lake.

      • So this is the first post I have commented on (the first reply above is mine).

        I am now working on my BSc dissertation thesis and in one chapter I explore MCMC methods. So I am really interested in “getting this right”.

        I have read some of Radford Neal’s works, BDA3, some of Michael Betancourt’s preprints, other texts, and this post and replies now… So based on all of it, I think that something along the lines of “exploring/surveying” enough the posterior to compute “expectations/inferential summaries” is what we’re actually trying to do with MCMC. Would you agree on that??

        Thanks a lot for the discussion, for someone like me that is starting wandering down this road, it’s really helpful!!

  13. > I sort of like the term “surveying” that Carlos Ungil uses above, but that’s far too loaded a term to use in a statistics context.

    How about the term “sampling” :P

  14. Subsequent to the discussion of the ratio of normal variables above:

    https://statmodeling.stat.columbia.edu/2019/03/25/mcmc-does-not-explore-posterior/#comment-1003114

    I think I’m pretty well convinced that if the density of the denominator is nonzero at 0 then the ratio doesn’t have an expected value. But there are *many* applications where we have very normal noise added to a signal, the whole thing is strictly positive, and the ratio is the quantity of interest.

    Here’s an example of the kind of thing I’m thinking of:

    Suppose you have a camera with a grid of 100×100 pixels, and two color filters. The camera is attached to a microscope, and it helps you quantify the red vs green fluorescence in a given mounted slide. The camera has a gain, and a bias, and uses 8bit A/D converters to give pixel values. When a slide is placed under the microscope it takes red and green images, sums up the value of the pixel brightness in each of the 10,000 pixels to get an R and G value, and then calculates R/G and reads it out on a display.

    R/G has the advantage that it is scale free, it doesn’t matter what the gain is or whatever, which is especially important because we don’t know what any given value means relative to some precisely calibrated brightness (like 1 watt/m^2 or something). We don’t have a precisely calibrated reference brightness, nor do we actually care. The actual quantity of interest is the ratio (ie. how many times brighter is the Red channel compared to the Green?)

    Some facts about an apparatus like this:

    1) We assume it can never read actually 0, it measures the sum of strictly positive quantities between 0 and 255, it sums them using a 64 bit accumulator that is guaranteed not to overflow or wrap-around. It adds 1 so the quantity is strictly positive, and then calculates a ratio using 64 bit floating point. We are happy to approximate 64 bit floats as real numbers. ;-)

    2) There are 10,000 of these pixels and they have some kind of noisy readout which is very well approximated by independent and identically distributed random variables that are each strictly bounded between 1 and 255. As such it has *excellent* normality properties. Let’s say the noise (what you get when you just take a picture of a blank slide) appears completely linear on a q-q plot out to 7 standard deviations.

    3) The designer of the apparatus was unfortunately told to give a readout of the ratio, so the only thing we get out of this thing is some kind of R/G value rather than the individual R and G values.

    4) The R and G signals are large relative to the mean value of the noise. So for example suppose the noise is well characterized as normal(10,1) (maybe even set there by a calibration step) and typical R and G values are of the order of say 100, so that the quantity we get displayed is actually (R-10)/(G-10) which we can say is something like Normal(100*Q,1) / Normal(100,1) and the quantity of interest is Q the ratio of the TRUE mean R to the TRUE mean G. Furthermore if G-10 < 1 the machine simply remeasures until it isn't. So the problem of our Normal(100,1) variable getting to be near zero is completely eliminated.

    The question is: what is the likelihood function p(Measurement | Rtrue,Gtrue) under something like these assumptions?

    Finally, suppose that the machine takes N of these (R-10)/(G-10) values and reports only the sample average? Clearly, if it weren't for the truncation of the distribution of R-10 and G-10 at 1, *the average of this quantity wouldn't exist* because there would be some *tiny* probability of getting infinity. However, because 1 is *very far* from the typical signal size of 100 on the scale of the noise sd=1 the truncation of this distribution causes an adjustment to the density function which is *incalculable in our machine precision* and therefore we *will* calculate with the unaltered normal density.

    Finally, we imagine that the images get dimmer and dimmer, so that the signal approaches pure noise from the left Normal(x,1)/Normal(y,1) for x and y both positive but between say 1 and 10. In this regime, how well does the instrument measure when it takes a sample average of N samples for N from say 1 to 10?

Leave a Reply to Alex Cancel reply

Your email address will not be published. Required fields are marked *