Erik van Zwet writes:
I saw you re-posted your Bayes-solves-multiple-testing demo. Thanks for linking to my paper in the PPS! I think it would help people’s understanding if you explicitly made the connection with your observation that Bayesians are frequentists:
What I mean is, the Bayesian prior distribution corresponds to the frequentist sample space: it’s the set of problems for which a particular statistical model or procedure will be applied.
Recently Yoav Benjamini criticized your post (the 2016 edition) in section 5.5 of his article/blog “Selective Inference: The Silent Killer of Replicability.”
Benjamini’s point is that your simulation results break down completely if the true prior is mixed ever so slightly with a much wider distribution. I think he has a valid point, but I also think it can be fixed. In my opinion, it’s really a matter of Bayesian robustness; the prior just needs a flatter tail. This is a much weaker requirement than needing to know the true prior. I’m attaching an example where I use the “wrong” tail but still get pretty good results.
In his document, Zwet writes:
This is a comment on an article by Yoav Benjamini entitled “Selective Inference: The Silent Killer of Replicability.”
I completely agree with the main point of the article that over-optimism due to selection (a.k.a. the winner’s curse) is a major problem. One important line of defense is to correct for multiple testing, and this is discussed in detail.
In my opinion, another important line of defense is shrinkage, and so I was surprised that the Bayesian approach is dimissed rather quickly. In particular, a blog post by Andrew Gelman is criticized. The post has the provocative title: “Bayesian inference completely solves the multiple comparisons problem.”
In his post, Gelman samples “effects” from the N(0,0.5) distribution and observes them with standard normal noise. He demonstrates that the posterior mean and 95% credible intervals continue to perform well under selection.
In section 5.5 of Benjamini’s paper the N(0,0.5) is slightly perturbed by mixing it with N(0,3) with probability 1/1000. As a result, the majority of the credibility intervals that do not cover zero come from the N(0,3) component. Under the N(0,0.5) prior, those intervals get shrunken so much that they miss the true parameter.
It should be noted, however, that those effects are so large that they are very unlikely under the N(0,0.5) prior. Such “data-prior conflict” can be resolved by having a prior with a flat tail. This is a matter of “Bayesian robustness” and goes back to a paper by Dawid which can be found here.
Importantly, this does not mean that we need to know the true prior. We can mix the N(0,0.5) with almost any wider normal distribution with almost any probability and then very large effects will hardly be shrunken. Here, I demonstrate this by usin the mixture 0.99*N(0,0.5)+0.01*N(0,6) as prior. This is quite far from the truth, but nevertheless, the posterior inference is quite acceptable. We find that among one million simulations, there are 741 credible intervals that do not cover zero. Among those, the proportion that do not cover the parameter is 0.07 (CI: 0.05 to 0.09).
The point is that the procedure merely needs to recognize that a particular observation is unlikely to come from N(0,0.5), and then apply very little shrinkage.
My own [Zwet’s] views on shrinkage in the context of the winner’s curse are here. In particular, a form of Bayesian robustness is discussed in section 3.4 of a preprint of myself and Gelman here. . . .
He continues with some simulations that you can do yourself in R.
The punch line is that, yes, the model makes a difference, and when you use the wrong model you’ll get the wrong answer (i.e., you’ll always get the wrong answer). This provides ample scope for research on robustness: how wrong are your answers, depending on how wrong is your model? This arises with all statistical inferences, and there’s no need in my opinion to invoke any new principles involving multiple comparisons. I continue to think that (a) Bayesian inference completely solves the multiple comparisons problem, and (b) all inferences, Bayesian included, are imperfect.
I suspect there is a widespread fantasy and aching desire to not to have responsibility for the model assumptions adequately representing and connecting to the world we are trying to learn about.
Fake data simulation from joint model A and then analyzing the fake data assuming model B is a great way to learn what we can about how important it is to get various aspects of model A not too wrong in Model B.
In general, organisms that don’t adequately represent their environment so that they eat food or avoid their prey just don’t survive.
Opps In general, organisms that don’t adequately represent their environment so that they hunt their prey and avoid their predators just don’t survive.
I just don’t buy the part
>it’s the set of problems for which a particular statistical model or procedure will be applied.
Each data analysis needs its own prior. We don’t say things like “this measurement error model is going to be used very often by a factory making widgets that have a diameter 24mm+-0.02mm and also by a satellite measurement system detecting the height above sea level of various lakes where it will be measured in hundreds of meters plus or minus tens of meters, plus I’m going to be measuring athletes skeletal dimensions where the femurs are about 50cm+- 8cm and the astrophysics people will be measuring the distance to the moon as hundreds of thousands of meters… So the prior on the mean should be this weird mixture model….
It’s Andrew’s way of imagining priors. It makes sense in some scenarios, so its failing is not that it’s wrong, rather that it’s not general enough. Still, if you can’t see what a non-frequency prior is in general, then it’s better than nothing. And definitely better than dreaming up some fictitious scenario in which the prior is magically a frequency.
Daniel > Each data analysis needs its own prior.
OK, but recall all the prior and likelihood need to be jointly considered discussions.
From Andrew’s quote it would seem the collective he defines is all applications that you would specify the same prior and likelihood for. There the _counterfactual_ repeated performance may or usually differs for each point the parameter space but the average with respect to the prior seems a worthwhile characterization of how well we can expect to do.
Note that this is Andrew quoting me quoting Andrew!
Frequentists consider the performance of their procedures under repeated sampling where the parameter is kept fixed. Similarly, Bayesians can consider the performance of their procedures under repeated sampling where the parameter is sampled from the prior. In particular, posterior probabilities can then be intepreted as frequencies, so in that sense “Bayesians are frequentists”.
This is probably not relevant when a data analysis is a one-off. However, I think it is relevant in the context of many similar analyses. That’s also when multiple tesing might be considered to be a problem.
Erik > This is probably not relevant when a data analysis is a one-off.
I think the primary focus of scientific statistical logic is to discern as best we can from a sample in hand what _would_ be repeatedly observed if we knew the true prior and likelihood. Whether it is one off or not.
This counterfactual collection of re-observing is implied in any probability model and why we can simulate from them. The joint prior and likelihood is a representation of a possible world where simulation generates a valid sample from that possible world.
What we learn from the possible samples (by simulation or math) needs to be transported to the actual world (we can’t do anything else) which highlights the unavoidable need to have responsibility for the model assumptions adequately representing and connecting to the world we are trying to learn about.
For those that be interested there is some discussion of developing ideas here The Logic of Statistics https://www.youtube.com/watch?v=FqE4ROHBKpY
> I think the primary focus of scientific statistical logic is to discern as best we can from a sample in hand what _would_ be repeatedly observed if we knew the true prior and likelihood.
Unless you mean something else by “scientific statistical logic” I don’t see why the primary focus of statistical inference would be the sampling properties of what could have been observed – instead of what can be concluded from what was actually observed. That’s the relevant question in practice. The former subsidiary question may be of interest – and help in the analysis of the latter – but it’s hardly the primary concern.
> instead of what can be concluded from what was actually observed
Does what was learned not imply what would be repeatedly observed?
I am just cashing out what was learned, say the posterior, probability in terms of what would repeatedly be observed by sampling from that posterior.
> Does what was learned not imply what would be repeatedly observed?
I don’t know. I could make a prediction of what may be observed if data is collected again given what I know now but it’s not clear to me what it means “what would be repeatedly observed if we knew the true prior and likelihood”. What true prior?
Say that I want to know the proportion of houses with radon issues in a town. I have some prior distribution for that number. I check a number of houses taken at random. I end with some posterior distribution for the proportion of houses with radon issues. What is the true prior? How is it related to the sample at hand of “x houses out of n had too high levels of radon”?
And in any case, I wouldn’t say that predicting what would be observed if data was collected again is the primary focus. That’s usually not the reason why the whole analysis was carried on in the first place.
The true prior referred to in my comments is simply the one you have assumed. The comments are about the need for the assumptions to adequately represent and connect to the world. _If they do_ the posterior provides an adequate sample of what would be repeatedly observed. _If they don’t_ all bets are off.
> _If they do_ the posterior provides an adequate sample of what would be repeatedly observed.
Depends on what you mean by “would be repeatedly observed”. There is a definite – but unknown – proportion of houses in the town with high levels of radon and if we perform batches of a dozen tests again and again and again in this town the number of positives per batch will be distributed following some distribution conditional on that actual number. It won’t be distributed averaging over the prior. That’s distribution represents our best guess but it is not a property of the world.
One may also make the “would be repeatedly observed” be about hypothetical observed repetitions in hypothetical towns different from the one in hand. I don’t think that should be our primary focus.
I like to comment on no free lunches, and that “robustifying” has a cost, too. In your example, you demonstrated good overall calibration of the intervals, but the calibration is not actually uniformly good. For example, when the absolute value of observation is in the range [3,4] less than 3% of the intervals exclude the true value (ie the intervals are too long), and when the absolute value of observation is larger than 4.5 more than 12% of the intervals exclude the true value (ie the intervals are too short). Also for those thetas that come from the original normal(0,0.5) the addition of slab doubles the type-M error for the cases with intervals not overlapping 0. I would even say that if some change in the model doubles type-M error, it’s not a robust change.
Going beyond this example, my thoughts on robust priors and models: I used to recommend t and Cauchy priors as robust, but the problems were that 1) too often made the inference unstable (less robust) as the thick tail gave too much mass for infeasible values, and 2) robustness could hide the errors in the data. I found that thinner tail priors 1) gave more stable estimates, and 2) it was easier to detect conflicts between prior and data, which lead to increased understanding of the phenomenon and improved models (or finding out errors in the data collection). I know Andrew has noticed the same, as he’s also switched from Cauchy priors to (half-)normals.
Also since I see you are following the discussion, I want to mention that I’m a big fan of your work on priors!
I agree there’s no free lunch. If the prior used to compute the posterior isn’t the same as the one generating the data, then the performance under repeated sampling will suffer.
I don’t have much experience with robust Bayes. I do think a separate mixture component to capture “outliers” can be useful. In particular, you get the posterior probability of that component which can help flag a potential problem.
(And thanks for the nice complement!)
Under a purist Bayesian philosophy, each analysis should effect the prior of subsequent analyses or, equivalently, all the analyses should be considered at once for a joint posterior.
Well it s true that (prior(u) * L1(u|x1)) * L2(u|x2) = prior(u) * (L1(u|x1) * L2(u|x2))
Written as just generic functions f0(u) * f1(u) * f2(u)
Now in some cases it is hard to know what should be consider a prior versus a likelihood, e.g. data augmentation https://statmodeling.stat.columbia.edu/2019/12/02/a-bayesian-view-of-data-augmentation/
or when one becomes aware of (observes) other’s priors (data for you?) for the same data set – Bayesian Predictive Decision Synthesis https://arxiv.org/abs/2206.03815
That’s true if each analysis is targeting the same draw from the prior. But I don’t think it’s true when each analysis targets a new draw.
Well it s true that (prior(u) * L1(u|x1)) * L2(u|x2) = prior(u) * (L1(u|x1) * L2(u|x2))
Written as just a generic functions f0(u) * f1(u) * f2(u) in some cases it is hard to know what should be consider a prior versus a likelihood, e.g. data augmentation or when one becomes aware of (observes) other’s priors (data for you?) for the same data set.
When I wrote “each analysis targets a new draw from the prior” I meant that we have (in your notation)
(u1,x1) fom pi(u1) * L(u1|x1) where we observe x1 and want to estimate u1
(u2,x2) from pi(u2) * L(u2|x1) where we observe x2 and want to estimate u2
Erik – then you have zero replication – only one sample for each value of the parameter??
Sorry Keith, the second line should be:
(u2,x2) from pi(u2) * L(u2|x2) where we observe x2 and want to estimate u2
Note that x1 and x2 could each be vectors of iid observations. The point is that each ui is a new sample from the prior. So, in particular, x1 doesn’t have any information about u2. Therefore the posterior of u1 should not be used as a prior to be combined with x2 to estimate u2.