Circling back to an old Bayesian “counterexample”

Hi everyone! It’s Dan again. It’s been a moment. I’ve been having a lovely six month long holiday as I transition from academia to industry (translation = I don’t have a job yet, but I’ve started to look). It’s been very peaceful. But sometimes I get bored and when I get bored and the weather is rubbish I write a  blog post. I’ve got my own blog now where it’s easier to type maths so most of the things I write about aren’t immediately appropriate for this place.

But this one might be.

It’s on an old example that long-time readers may have come across before. The setup is pretty simple:

We have a categorical covariate x with a large number of levels J. We draw a sample of N data points by first sampling a value of x from a discrete uniform distribution on [1,…,J]Once we have that, we draw a corresponding from a normal distribution with a mean that depends on which category of x we drew.

Because the number of categories is very large, for a reasonably sized sample of data we will still have a lot of categories where there are no observations. This makes it impossible to estimate the conditional means for each category. But we can still estimate the overall mean of y.

Robins and Ritov (and Wasserman) queer the pitch by adding to each sample a random coin flip with a known probability (that differs for each level of x) and only reporting the value of y if that coin shows a head. This is a type of randomization that is pretty familiar in survey sampling. And the standard solution is also pretty familiar–the Horvitz-Thompson estimator is an unbiased estimator of the population mean.

All well and good so far. The thing that Robins, Ritov and Wasserman point out is that the Bayesian estimator will, in finite samples, often be massively biased unless the sampling probabilities are used when setting the priors. Here is Wasserman talking about it. And here is Andrew saying some smart things in response (back in 2012!).

I read this whole discussion back in the day and it never felt very satisfying to me. I was. torn between my instinctive dislike of appeals to purity and my feeling that none of the Bayesian resolutions were very satisfying.

So ten years later I got bored (read: I had covid) and I decided to sketch out my solution using, essentially, MRP. And I think it came out a little bit interesting. Not in a this is surprising sense. Or even as a refutation of anything anyone else has written on this topic. But more it is an example that crystallizes the importance of taking the posterior seriously when you’re doing Bayesian modelling.

The resolution essentially finds the posterior for all of the mean parameters and then uses that as our new information about how the sample was generated. From this we can take our new joint distribution for the covariate, the data, and the ancillary coin and use it to estimate average of an infinite sample. And, shock and horror, when we do that we get something that looks an awful lot like a Horvitz-Thompson estimator. But really, it’s just MRP.

If you’re interested in the resolution, the full post isn’t too long and is here. (Warning: contains some fruity language). I hope you enjoy.

9 thoughts on “Circling back to an old Bayesian “counterexample”

  1. Hi, Dan!

    I hope the covid didn’t cause you to lose your sense of smell—I’ve heard that’s really annoying. Also, I didn’t know you had a blog. I will add it to the links here.

    I like your linked post and I encourage everyone to read it. I like your point that “If you’re going to make Bayes work for you, think in terms of observables (eg the mean of the complete sample) rather than parameters.” Also I have a few comments:

    1. You say, “We are still true committed subjective Bayesians.” I don’t think you need to say “subjective,” as the choice of the prior is no more subjective than the choice of the data model, and I don’t hear classical statisticians talking about their subjective choice of the Poisson model or the logistic transformation or whatever other conventional choice they are making that day. For more ranting on that, see this paper with Hennig.

    2. Please please please don’t talk about “a random coin flip with a known probability (that differs for each level of x)”! You can load a die but you can’t bias a coin.

    3. Inference with survey weights can be really hard. I guess it depends on the application, but sometimes it seems that theoretical statisticians (Basu excepted) don’t really get it: they think that inverse-probability weighting is some kind of magic trick, without realizing that (a) unbiased don’t mean jack if your variance is high enough, (b) we almost always use ratio estimates which are just about never unbiased, and (c) the weights or sampling probabilities or whatever aren’t really known anyway, they’re just modeled like everything else. It’s turtles all the way down.

    A consequence of point 3 is that it can be hard to develop a Bayesian method that is competitive—if it is being asked to compete with the in-practice-nonexistent classical inverse probability weighting approach that has zero bias and reasonable variance.

    To put it another way, if the classical method was all that, then applied Bayesians like me wouldn’t be working our butt off trying to construct a Bayesian approach; we’d just stick with the wonderful thing that already works. But it doesn’t, so we do.

    And constructing reasonable nonparametric Bayesian approaches is a struggle. Yajuan, Natesh, and I have this cool 2015 paper but the method’s too complicated for practical use. Some more recent work with Yajuan and others is here and here. I’d like to believe that we’ll someday reach the end of this particular tunnel.

    • “You can load a die but you can’t bias a coin.”

      However if it is thrown and allowed to bounce it can have a stable probability of heads that is not close to 1/2 and it is easy to alter this probability by shaving the edges of the coin to different angles.

    • Thanks Andrew! I got lucky and kept my sense of taste (although some people might argue that I lost that years ago).

      1. Yes. I was mostly taking the piss out of Robins/Ritov/Wasserman for making that distinction. I don’t think there is such a thing as a true/committed/subjective Bayesian. Just people who do their best to use their tools. And yes, I agree with that paper you wrote with Hennig!

      2. This is how I know you’re not a cricket fan. You won’t believe what a guy can do with a piece of sandpaper he smuggled in his underwear.

      3. Yeah. I very much agree. Weights are great when they work, but especially in this sort of context where most of the groups are empty, it’s hard to imagine them being a good idea! And I think if you dig into the variance of the Bayesian estimator you see that. To be honest, I don’t think that the approach that I outlined is a good one. It just matches the target. I think that the sheer awkwardness of it should make people question whether that target was a good thing to match. The emphasis on the lack of any type of a priori smoothness in the group means is particularly awkward and I would definitely be very very skeptical if someone tried to model that!

      And yeah – I always read the papers Yajuan writes with great interest! It’s a fascinating topic. And I definitely wasn’t trying to add to it! I mostly just wanted to lay out how you could post process your posterior to get to the traditional estimator. I very much hope people will work out how you can use the posterior to get to something a lot better!

  2. Andrew’s point 3 is salient, one should not forget Basu’s elephants when considering Horvitz-Thompson. I am surprised that there is no mention of Sims, since he played a major role in the original discussion.

  3. Then we can ask ourselves a much more Bayesian question: What would the average in our sample have been if we had recorded every y_i?

    But, of course, that isn’t actually the quantity that I’m interested in. I’m interested in that quantity averaged over realisations of r.

    I don’t understand. What if I’m interested in what the uncensored average in a completely new sample would be, meaning new covariates x_i?

Leave a Reply

Your email address will not be published. Required fields are marked *