By default we use central posterior intervals. For example, the central 95% interval is the (2.5%, 97.5%) quantiles.

But sometimes the central interval doesn’t seem right. This came up recently with a coronavirus testing example, where the posterior distribution for the parameter of interest was asymmetric so that the central interval is not such a reasonable summary. See Figure 1 of this paper with Bob Carpenter.

Instead, sometimes it can make sense to use a shortest probability interval (similar to the highest posterior density interval), as discussed in this paper with Ying Liu and Tian Zheng.

The brute force approach to computing a shortest probability interval is to compute all the intervals of specified coverage and take the shortest. For example, if you have 1000 simulation draws and you want the 95% interval for a particular quantity of interest, z, you sort the 1000 simulations of z, compute 50 intervals, (z_1, z_951), (z_2, z_952), . . ., (z_50, z_1000), and you take the interval that’s the shortest.

There’s one additional twist: if the quantity of interest is bounded, you want to allow the possibility of the interval to go all the way to the boundary. So if there’s a lower bound a and an upper bound b, you also consider the intervals, (a, z_950) and (z_51, b) as candidates.

Here’s R code:

spin <- function(x, lower=NULL, upper=NULL, conf=0.95){ x <- sort(as.vector(x)) if (!is.null(lower)) { if (lower > min(x)) stop("lower bound is not lower than all the data") else x <- c(lower, x) } if (!is.null(upper)) { if (upper < max(x)) stop("upper bound is not higher than all the data") else x <- c(x, upper) } n <- length(x) gap <- round(conf*n) width <- x[(gap+1):n] - x[1:(n-gap)] index <- min(which(width==min(width))) x[c(index, index + gap)] }

This particular code is kinda hacky in that it does not account for fencepost errors, but it conveys the general idea.

It works! But there is a concern that the procedure has unnecessary Monte Carlo error, in that we’re computing the minimum width of these empirical intervals. It seems like some smoothing should improve things.

In the above-linked Liu et al. paper, we present a pretty complicated procedure to compute a shortest probability interval efficiently. It works well in the paper, but I don’t really trust it in general. I think we should have something more transparent and reliable.

The short version of this post is that it would be good to have the shortest posterior interval as an option, not for Stan’s automatic output---I think by default the central interval is fine---but as an available option when users want it because it does come up.

The longer version is that I think there’s a research opportunity to come up with a more efficient shortest posterior interval computation that’s also clean and robust.

Perhaps it could be couple with the command ’emp.hpd’ from the R package ‘TeachingDemos’, which seems to work in cases where spin() fails:

library(TeachingDemos)

set.seed(123)

data <- rbeta(10,0.5,1)

summary(data)

spin(data)

emp.hpd(data)

Fjrubio:

I would not use the emp.hpd() function as written because it does not allow the user to include boundaries. But they seem to be doing something more sensible with the discreteness. As your example shows, the spin() function fails when you give it just 10 iterations. It should be able to do something in such cases!

Indeed, sorry for not explaining this clearly. I meant coupling the two functions in order to allow for the inclusion of boundaries. This is, calculating the intervals with emp.hpd() when spin fails, or just adapting the third option in spin().

Of course, one could argue that calculating hpd intervals with small samples may not be of interest. In fact, emph.hpd also fails for, e.g., n=5. I did not find this problem with samples sizes larger than 20.

Fjrubio:

The right thing would not be to use the two functions but rather to write one function that allows for boundaries and can handle small samples using interpolation. This should not be hard to do.

I totally agree!

Technically neither function produces the HPD if the distribution is highly bimodal (or tri-modal, etc.). I have tried thinking a few times about the best approach in these cases, I have a couple of ideas, but not any that seem worth working out the code for. Also, I don’t think that I have ever had a real case with strong enough bimodality that the single interval was not the answer.

That being said, you could call emp.hpd as `emp.hpd( c(lower, x, upper) )` to do the same thing (without the checks). Or the author/maintainer of emp.hpd and the TeachingDemos package could be easily persuaded to add similar code that is in the above spin function to emp.hpd.

I have also wondered if there would ever be a practical use for the longest intervals.

Greg:

1. Regarding your first paragraph, that’s why I call it “shortest posterior interval” or spin, not “highest posterior density” or hpd.

2. In any case I think it’s worth writing a new function that does interpolation. If people are really gonna be computing 95% intervals from 20 simulation draws, then we should be returning something reasonable (or an error message saying that not enough information is available).

OK, the version of emp.hpd in TeachingDemos on github has been updated to include optional ‘lower’ and ‘upper’ arguments like your code above. I also added a check for the combination of confidence level and sample size that could cause problems and have it issue a warning then return the range of values (100% C.I.).

Thanks!

What’s wrong with the name “highest (posterior) density interval/region”?

Ah, the problem is with multi-modal distributions where a highest-density interval with a given coverage may not exist… but then maybe using an interval doesn’t make much sense anyway.

I’m mildly surprised by “We do not attempt a conclusive justification of HPD intervals here; we merely note that in the pre-simulation era such intervals were considered the standard, which suggests to us that the current preference for central intervals arises from computational reasons as much as anything else.”

I always thought the argument for HPD (in particular, against central regions) was Bayesian coherence, i.e. that when the distribution is asymmetric there can be (usually are?) pairs of points with the same probability density at opposite ends of the distribution where one is included and the other is excluded from the central distribution. I assumed that the reason for the changing preferences was decreasing levels of Bayesian purity …

Ben:

It’s hard to come up with a coherent justification for hpd because the density depends on the parameterization. The hpd or shortest posterior interval for log(theta) will not be the logarithm of the hpd or shortest posterior interval for theta.

Essentially everything than can be said about a probability density depends on the parametrization… except p-values!

Carlos:

No, the central probability interval doesn’t change with a (monotonic) transformation.

I don’t disagree: (p-values, 0.025 and 0.975 percentiles) are invariant under a continuous, monotonous transformation.

But when you say in your post that in that example “the posterior distribution for the parameter of interest was asymmetric so that the central interval is not such a reasonable summary” that asymmetry depends on the parametrization. If the HPD is problematic because it depends on the parametrization, maybe the central interval will also be problematic because it won’t be a reasonable summary for alternative parameterizations.

If “sometimes it can make sense to use a shortest probability interval” does that mean that sometimes the shortest probability interval does make less sense than the central interval? If that’s not the case, and the shortest probability interval is always preferable to (or as good as) the central interval, that would provide a justification for it.

Carlos:

Yes, it has something to do with the zero boundary as well. Another example where this came up (actually, our motivating example) was the group-level sd parameter tau in the 8 schools model in BDA.

Andrew, would you say that HPD is a better default then? I have never reported one, and in most applications would guess it would make little difference, but when/if convenient tools to compute reliably appear, would be good to consider if it’s a better default option…

I’m not sure if you mean “Yes, the shortest interval is sometimes worse” or “yes, the shortest interval is always better” (or neither).

By the way, from that covid paper: “The asymmetric posterior distribution with its hard bound at zero suggests that the usual central 95% interval will not be a good inferential summary. Instead we use highest posterior density or shortest posterior interval, for reasons discussed in Liu, Gelman, and Zheng (2015). The resulting 95% interval for π is (0, 1.8%), which is much different from the interval (1.1%, 2.0%) reported by Bendavid et al. (2020a).“

If you had used the central interval (0.1%, 1.9%), or whatever it was, instead of (0%, 1.8%) it doesn’t seem that it would have been a noticeable change. They are much closer to each other than to the interval (1.1%, 2.0%). But I agree that the HPD is more satisfactory (even though the hard bound at zero is parametrization-dependent).

This is why I don’t like SPIN or HPD. At least the central intervals don’t change under monotonic transforms.

P.S. I originally entered the above, but must not have submitted; I think I lost it when I edited Daniel Lakeland’s post to insert pre tags. Yes, WordPress is the worst.

Bob:

Yeah, Ying, Tian, and I thought a lot about this when writing our paper. On one hand, the central interval is easy to calculate and explain, and it’s invariant to monotone transformations. On the other hand, there are cases (such as inference for the prevalence parameter in the coronavirus testing example) where I think the central interval is not appropriate. The deeper issue here is that there’s no Bayesian justification for interval estimation or interval summaries at all! So it’s all about communication.

“So it’s all about communication”

I know this would never fly, but it seems like it would be a good idea to simply plot the posterior and ditch reporting intervals altogether. Intervals of any kind can still encourage dichotomous thinking. Looking at a plot might help people appreciate the uncertainty in the estimate better than do intervals of arbitrary cut points. If the idea is communication, a plot seems to communicate far more than intervals.

Actually this is almost always what I do. There are a LOT of density plots in most of my output.

If people aren’t familiar with the idea, it encourages them to ask rather than assume and this usually leads to them having a better understanding too.

Hmm, I did this for a client on a side contract job as well, and it did start a good discussion.

I’m just not sure it would fly with most journal editors and reviewers (that I deal with)… Maybe I will try that on the next paper and see what happens.

In fact, it seems like plots should be the default way to look at it and intervals should be the exception/side-info. I always look at the plots myself, but I never put them in the paper. That seems dumb.

“If people aren’t familiar with the idea, it encourages them to ask rather than assume and this usually leads to them having a better understanding too.”

Good point.

I’m teaching myself Julia these days, which I like a lot. In fact I’m working on abandoning R entirely.

Here in Julia I create a function called intlenmaker that takes some data, and then returns a function that takes a percentile and returns the interval length for the quantiles (p, .95+p) …. I then call univariate a optimization on that function to find the optimum constrained to 0,.05

Oooh… ugly. The original had indentation. and the blog is eating stuff after less than signs… yeesh…

here is the code wrapped in pre tags:

using Optim, Distributions, Random

function intlenmaker(data)

function(x)

if (x > 0.05)

return Inf

else

(a,b) = quantile(data,(x,.95+x))

return(b-a)

end

end

end

data = rand(MersenneTwister(234),Gamma(3,1.0/2),150)

optimize(intlenmaker(data),0.0,.05, Brent())

I hate you WordPress

Oh well, as long as we’re settling for ugly code, here’s a version that allows you to specify the coverage you want, and then finds the highest probability density interval of that coverage… I call it for 95% and for 75%

using Optim, Distributions, Random

function intlenmaker(data,cov)

function(x)

if (x > 0.05)

return Inf

else

(a,b) = quantile(data,(x,cov+x))

return(b-a)

end

end

end

data = rand(MersenneTwister(234),Gamma(3,1.0/2),150)

optimize(intlenmaker(data,.95),0.0,.05, Brent())

optimize(intlenmaker(data,.75),0.0,.25, Brent())

ack… bug… I left in .05 instead of replacing by cov in the if statement:

using Optim, Distributions, Random

function intlenmaker(data,cov)

function(x)

if (x > cov)

return Inf

else

(a,b) = quantile(data,(x,cov+x))

return(b-a)

end

end

end

data = rand(MersenneTwister(234),Gamma(3,1.0/2),150)

optimize(intlenmaker(data,.95),0.0,.05, Brent())

optimize(intlenmaker(data,.75),0.0,.25, Brent())

which gives the following two results:

julia> optimize(intlenmaker(data,.95),0.0,.05, Brent())

Results of Optimization Algorithm

* Algorithm: Brent’s Method

* Search Interval: [0.000000, 0.050000]

* Minimizer: 3.020134e-03

* Minimum: 2.997589e+00

* Iterations: 35

* Convergence: max(|x – x_upper|, |x – x_lower|) optimize(intlenmaker(data,.75),0.0,.25, Brent())

Results of Optimization Algorithm

* Algorithm: Brent’s Method

* Search Interval: [0.000000, 0.250000]

* Minimizer: 7.382550e-02

* Minimum: 1.636529e+00

* Iterations: 26

* Convergence: max(|x – x_upper|, |x – x_lower|) <= 2*(1.5e-08*|x|+2.2e-16): true

* Objective Function Calls: 27

haha, still a bug… It should say

if(x > 1-cov)

I retroactively added the pre tags for you. But yes, WordPress is the worst. What I hate is that it doesn’t respect characters within pre tags—you still have to use “&” for an ampersand.

Thanks Bob.

In Julia, on my older Haswell computer, the optimization ran so fast I had no reason to even think about it. So I went back and benchmarked it with the “using BenchmarkTools” stuff:

julia> @benchmark optimize(intlenmaker(data,.95),0.0,.05, Brent())

BenchmarkTools.Trial:

memory estimate: 48.58 KiB

allocs estimate: 75

————–

minimum time: 124.127 μs (0.00% GC)

median time: 144.595 μs (0.00% GC)

mean time: 161.186 μs (4.05% GC)

maximum time: 14.365 ms (98.69% GC)

————–

samples: 10000

evals/sample: 1

So it takes about 140 microseconds to figure out the hpd interval from 150 samples using optimization. The number of optimization steps was on the order of 20-30 in the case above.

If I make a much bigger problem… 1500 samples:

julia> @benchmark optimize(intlenmaker(data,.95),0.0,.05, Brent())

BenchmarkTools.Trial:

memory estimate: 499.61 KiB

allocs estimate: 87

————–

minimum time: 2.224 ms (0.00% GC)

median time: 2.315 ms (0.00% GC)

mean time: 2.426 ms (1.05% GC)

maximum time: 8.606 ms (70.65% GC)

————–

samples: 2053

evals/sample: 1

Which is still really fast.

I’m working on something with a couple of colleagues and we’ve just started using Julia for some things. The motivation was to provide a tool so the client can explore some parts of parameter space by moving a slider and seeing how the numbers change. Generating “the numbers” requires 5000 or so samples from a distribution and then calculating some summary stats on the samples; in Julia this is fast enough that if you move the slider slowly the updating is very smooth. Of course, once we’re using it for this, we’re thinking “why not use it all the time”?

Indeed. ALL … THE … TIME

It’s not just that Julia is fast (basically on the same order of magnitude of C/C++ sometimes faster sometimes slower)

Julia is a LISP dialect at heart which makes it HUGELY expressive (it uses Julia itself as its macro-language so you can fully meta-compute for example), but it’s also designed for high performance and easy interfacing to other languages, which means it plays nicely with others, unlike most implementations of say Common Lisp. And it’s got a matlabish syntax so it looks like normal computing to the non-lispers.

Literally the *only* downside to Julia at this time that I’m aware of is that it compiles things just in time and that can take a while. If you have a small program that’s supposed to execute in a few seconds or milliseconds, Julia will likely add somewhere between 1 and 30 seconds of start-up time to that depending on which libraries you bring in with “using …”

You can get around this using the PackageCompiler.jl which bundles in all those libraries in a precompiled static form, if you’re going to be re-running this “quick script” over and over, that’s the way to go.

I plan to attend Julia con which is an online remote conference this year. Hope to see you there Phil.

Daniel:

We have Stan.jl, so you can access all the power of Stan from Julia!

Yes, this is great. The entire ecosystem of Bayesian modeling tools and data processing tools in Julia is quite impressive.

I’m really rooting for Julia to become competitive with R and Python in the data science arena. It’s fast, elegant, and has some super packages like Flux (deep neural networks). I wouldn’t be able to use it in most of my engagements, though — not quite enough pieces in place to allow me the flexibility I’d have with R or Python — and I’m really wanting that to change. I really feel like Julia is in some sense the successor to R.

Please continue to share Julia successes here!

Wayne, it’s close. What are the things you think keep you from “being able to use it in most of my engagements”?

How can you make this more efficient? Sorting is O(N * log N) for N draws and everything else is O(N).

While you can’t bring the overall complexity down, you can attack the constant factor by not sorting the whole array. If you want a 95% interval, then you only need the top 10% and bottom 10% of draws which you can do by keeping two sorted stacks and updating those. It’s still O(N * log N), because 10% is a fraction of N, not a constant size, but the log N is now log 0.2 N, but it’ll be faster.

Bob:

For that matter, I don’t actually know how the quantile() function in R works. Does it do a full sort and then grab the order statistics, or does it do something more clever?

The quantile function in R uses partial sorting (at least in some cases, I did not look at all of the code to verify). Here is an example of partial sorting:

> set.seed(2020)

> x xs head(xs, 10)

[1] 1 2 3 4 5 6 9 7 8 15

> tail(xs, 10)

[1] 94 88 90 93 95 96 97 100 99 98

We can see that once the final 5th and 95th elements are sorted to the correct positions it does not bother sorting the rest. This saves some time along the lines that Bob described.