Boostrapping your posterior

Posted on August 9, 2016 9:05 AM by Andrew

Demetri Spanos writes:

I bumped into your paper with John Carlin, Beyond Power Calculations, and encountered your concept of the hypothetical replication of the point estimate. In my own work I have used a similarly structured (but for technical reasons, differently motivated) concept which I have informally been calling the “consensus posterior.”

Specifically, supposing a prior distribution for the true effect size D, I observe the experimental data and compute the posterior belief over values of D. Then I consider the “replication thought experiment” as follows:

1. Assuming my *posterior* as the distribution for true value of D …

2. Consider the posterior distribution that another researcher would obtain if they performed the identical experiment, assuming they hold the same prior but that the true distribution of D is as my posterior distribution

3. I then get a distribution over posterior distributions that I think of as the set of beliefs other researchers might hold, given they start from the same assumptions and have the same experimental capability that I do. Then I can calculate various point values from this “consensus” distribution. As I’m sure is clear to you, the consensus distribution is always much wider and less spiky than even the semi-smoothed distributions arising from a weakly-informative prior.

4. In my field (automated control systems, distributed computing, sensor fusion, and some elements of machine learning), I have found this provides a much more stable and well-conditioned signal for triggering automated sensor-driven behaviors than conventional techniques (even conventional Bayesian techniques). We are especially sensitive in our work to multimodal distributions and especially relative magnitudes of local peaks (because we often want to trigger a complex response or set of responses, not just get one average-case point estimate).

5. perhaps the most important

Now, having just read your paper, the variant I describe above seems “obvious,” so I wonder if you have thought on this subject before and could point me to any additional resources. Or, perhaps, if you see some fatal flaw that I have missed I would appreciate being told about that as well. “Works in the field” doesn’t necessarily mean “sound in principle,” and I would like to know if I am treading somewhere dangerous.

My reply: This could work, but much depends on how precise is your posterior distribution. If your data and your prior are weak, you could still have problems with your distribution being under-regularized (that is, being too far from zero).

8 thoughts on “Boostrapping your posterior”

Tom Passin on August 9, 2016 11:45 AM at 11:45 am said:

It seems to me that this is essentially a way of smoothing your posterior distribution. You could have smoothed it using, say, LOWESS, or smoothing splines, or some other method. However, you are smoothing it by basically simulating experimental runs. In the end, your smoothed distribution will have the same problems – if there are any – as the X-smoothed one (where X might be LOWESS, for example). E.g., if there is a spurious bulge in your real measured distribution, it will be there (though smoothed) in the final result.

Would you explain why you find your procedure better than just smoothing the experimentally-derived distribution? Is it just that there are too few points in the latter, or something else?

Reply ↓
- Daniel Lakeland on August 9, 2016 1:10 PM at 1:10 pm said:
  
  the concern here seems to be that the model produces too-precise estimates of the parameters. That would seem to be the case when the assumptions in the likelihood are not being met, specifically when your data collection process collects data that isn’t from the full range of the possibilities, or isn’t independent. It seems like you could incorporate that knowledge in the likelihood. This would give you the regularization you want, and still maintain a consistent logical basis for the model (ie. still satisfy Cox axioms).
  
  For example, suppose you’re collecting data from a group of objects where you believe that your data collection process has some bias, and has some selection effects that reduce the variation. in essence maybe your data are sort of
  
  m ~ normal(mu,sigma)
  
  But instead of collecting from that full range, you’re collecting from a subset that is sort of
  
  x ~ normal(mu+b,k*sigma) where k in [0,1]
  
  so, specify your prior for mu, sigma, specify a prior for b the bias, and k the reduction of the spread, and then fit your model to infer something about the full population parameters mu,sigma
  
  I think similarly, you can do something to relax the independence. Suppose for example that you’re making measurements but you know that the measurement instrument occasionally measures not some new object but one of the old objects you’ve already seen (or a new object, but the measurement is highly related to some existing measurement, you just don’t know which one). Suppose you know about how often this occurs. let’s say with frequency f.
  
  If there are N measurements, and about Nf of them could be predicted nearly exactly if you just knew which of the other measurements to compare it to, you could work out some probabilistic calculations on this assumptions. I suspect, though I haven’t worked it out, that for a normal model, you might get away with forming the independent likelihood, and then raising it to a power N(1-f)/N = 1-f
  
  Another thing you might get away with is simply assigning Nf of these data at-random a flag “not independent” and then construct a likelihood based on just the independent ones. Do this a largish number of times, and do inference on each subset, combining the inferences.
  
  Things of this basic nature, that is, modifying the likelihood to take into account imperfections of the data collection process that lead to excess precision are under-utilized in typical practice.
  
  Reply ↓
  - Daniel Lakeland on August 9, 2016 3:02 PM at 3:02 pm said:
    
    Thinking about the math here, suppose p(D1,D2,D3,…,Dk | Dk+1,Dk+2,…,Dn,Model,Params) = 1 for some appropriate set of orderings (that is, you could predict k of the data values perfectly if you knew the other data values). Then you can write your likelihood in terms of the Dk+1…Dn and the uncertainty in the ordering… and you could approximate the likelihood in terms of an average across some randomly selected subset of the orderings…. and you’ll probably wind up with something along the lines of a bootstrapping of the likelihood by sub-sampling the data. I’m a bit distracted by some other stuff going on to work out the math, but I hope the direction to go in seems clear.
    
    If you can’t predict the k data values perfectly, you can still potentially use a tightly constrained distribution of some sort. The specifics depend on what the non-independence you’re expecting looks like. For example, you might be able to do a regression on the k…N values, and then predict the D1–Dk values as very tightly constrained by that regression or something
    
    Reply ↓
Alex on August 9, 2016 12:00 PM at 12:00 pm said:

Efron has a couple related papers on bootstrapping Bayes estimates:

Bayesian inference and the parametric bootstrap: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3703677/
Frequentist accuracy of Bayesian estimates: http://onlinelibrary.wiley.com/doi/10.1111/rssb.12080/full

Reply ↓
- Andrew on August 9, 2016 2:56 PM at 2:56 pm said:
  
  Alex:
  
  Regarding Efron and bootstrap, see this paper by Aki and me, Bootstrap averaging: Examples where it works and where it doesn’t work.
  
  Reply ↓
Juho Kokkala on August 9, 2016 2:37 PM at 2:37 pm said:

Funny observation: the process described in the post results in the same distribution for the parameter, as Bayesian inference with the likelihood changed so that we are the fictitious researcher (that is, pretend our observed D is not from the original model, but a replication simulated in this manner). So, this is interpretable as a modification of the likelihood, although a rather strange one.

Proof sketch: write the joint distribution of everything as a “chain” p(theta)p(D|theta)p(theta_samplefromposterior | D) p(D_rep | theta_samplefromposterior) p(theta_samplefromconsensus | D_rep). Reverse the direction of the chain. As the joint distribution of every consecutive pair of a parameter and data is the same, there is a symmetry implying p(theta_samplefromconsensus | D) (as a function of both the event and the condition) looks exactly like p(theta | D_rep).

Reply ↓
- Corey Yanofsky on August 10, 2016 8:53 AM at 8:53 am said:
  
  Shorter proof sketch: the expanded model is a stationary Gibbs sampler; any two pairs of random variables separated by the same number of steps have the same joint distribution.
  
  Reply ↓
Fernando on August 10, 2016 1:40 PM at 1:40 pm said:

Interpreting this as replication, the critical assumption is about the use of identical procedures. At this point we are just talking about statistical replication.

However, most replication debates are about questions re skill, procedures, methods being same or not. This requires model expansion, so we model observer, and procedural variation, alongside the model of Nature we care about.

My attempt at this here http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2496670

Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Boostrapping your posterior

8 thoughts on “Boostrapping your posterior”

Leave a Reply Cancel reply