Tai Huang writes:
I am reading this paper [Why we (usually) don’t have to worry about multiple comparisons, by Jennifer, Masanao, and myself]. I am searching how to do multiple comparisons correctly under Bayesian inference for A/B/C testing. For the traditional t-test approach, Bonferroni correction is needed to correct alpha value.
I am confused with your suggestion of not worrying about multiple comparisons. For example,
– For A/B testing, if P(A>B) > 0.975, I declare A wins.
– For A/B/C testing, does “No Correction Needed” mean that I can still use 0.975 to compare and get the same type 1 error rate as in A/B testing case?
My reply:
The published version of this paper is here.
The short answer is that I think it’s a mistake to “declare A wins” if Pr(A>B) > 0.975. The problem is with the perceived need for a deterministic conclusion. I don’t think type 1 error rates are relevant for reasons discussed here and here.
I can see that some readers might think my answer is “cheating”—my way around the type 1 error calibration problem is to say that I don’t care about type 1 error—but I’m serious. Here’s something I wrote about type 1 errors etc. back in 2004, one of our earliest blog posts.
Sometimes we have to make a decision and take a discrete action. in that case, decide on a utility function and then use your MCMC samples to calculate expected utility and then do whatever is best.
Why wouldn’t we want Pr[A > B] to be calibrated just like any other posterior expectation? I don’t see a calibration problem in theory, though it may be tough in practice to get tail probabilities with good relative error using MCMC.
Is it that you only care about calibration for particular predictive quantities of interest and Pr[A > B] isn’t of interest?
I’m working on a paper on differential expression of gene splice variants across treatment and control groups. The field has set the task to be null hypothesis significance testing where the null is that there is less than 5% difference in expression between treatment and control conditions (the 5% isn’t related to a p-value threshold). So the paper evaluates Type 2 (false negative) error rates at a fixed Type 1 (false positive) rate and also calculates area-under-the-ROC-curve (for simulated data, of course—nobody has the ground truth for the genome in all these experimental conditions—that’s why they’re doing experiments). I think it’d be much more useful to report the estimates and uncertainty and get a better measure of overall calibration using something like cross-validation of predictive log density. The bind we’re now in is that we’re going to have to report Type I errors at a fixed Type II error rate and then area-under-the-ROC curve. The irony here is that what they’re doing with these results is sorting the genes by p-value than looking at the ones with the lowest p-value for wet-lab followup. So the goal is really ranking, which can be done directly with Bayesian approaches (I have an example in my repeated binary trial case study for Stan).
Bob:
I do want Pr(A>B) to be calibrated. But this has nothing to do with multiple testing corrections or hypothesis testing. I want Pr(A>B) to be calibrated in the Bayesian way: Averaging over all the cases when Pr(A>B) = 0.2, say, A should be actually greater than B 20% of the time.
That’s not the “bayesian way”. It’s the “I started out as a frequentist, switched to bayes, but can’t quite let go of frequentism way”, or ISOAAFSTBBCQLGOF way for short.
The bayesian way would be something more like “Pr(A greater than B| evidence+assumptions) = .2 means 20% of the possibilities left unspecified by evidence+assumptions make A greater than B”.
To illustrate the difference suppose we’re trying to predict which days it’s going to rain. Here’s some calibrated probabilities:
day1 day2 day3 day4 day5 day6
Pr:
That’s not the “bayesian way”. It’s the “I started out as a frequentist, switched to bayes, but can’t quite let go of frequentism way”, or ISOAAFSTBBCQLGOF way for short.
The Bayesian way would be something more like “Pr(A greater than B| evidence+assumptions) = .2 means 20% of the possibilities left unspecified by evidence+assumptions make A greater than B”.
To illustrate the difference suppose we’re trying to predict which days it’s going to rain. Here’s some perfectly calibrated probabilities determined by analyzing historical frequencies:
1Feb 2Feb 3Feb 4Feb
Pr(rain) .25 .25 .25 .25
outcome rain dry dry dry
Alternatively I go out and create Physics weather models using distributions to model any parameter I’m uncertain about (one of my stat thesis advisors spent his career doing just that). I then run simulations using different values for those parameters and let Pr(rain|model) be the percentage of simulations that give rain for that day.
Note: This probability has nothing to do with what frequency it will rain over a number of different days, but rather is what percentage of unknowns led to rain for that specific day This view of probability is Laplace’s old “ratio of favorable cases to all cases” and not a frequentist one. Based on this, I produce,
1Feb 2Feb 3Feb 4Feb
Pr(rain) .9 .1 .1 .1
outcome rain dry dry dry
The former is “calibrated” in the sense you describe, while the later is not. Yet if we use the prediction rule “Predict rain when odds favor rain” the “calibrated” version will be wrong 25% of the time over the next 4 days, while the later will be wrong 0% over the next four days!
If you think about it, the “calibration” you describe is a long held prejudiced which isn’t backed up by any analytical evidence or requirement from the mathematics itself.
I share this attitude towards calibration. I actually use a similar forecasting example in my book. I learned it from Jaynes, I am pretty sure.
Hello McElreath,
Jaynes would have pointed out the later is “more informative” than the calibrated version and backed it up by computing the entropy of both, with the later having smaller entropy.
Then he would have said the calibrated version isn’t “wrong”, just appropriate for the low informational state where all you know is some historical frequencies; as typical of the pre-DNA genetics stuff Fisher liked to work on.
The later isn’t “right”, it’s just appropriate when you know more; for example when you have at hand all of physics as background knowledge, the way Jeffreys did in Geophysics.
At the end of the day though, the calibration Andrew mentioned isn’t some kind of universal Bayesian requirement.
Any blogs/books/papers for someone who wants to understand Bayesianism from this perspective?
The Rosenkrantz edited volume of Jaynes’ papers is what I found most rewarding. But it’s not one long argument, so requires some work. Jaynes’ Probability Theory book is more organized.
My own book is rather Jaynesian, with a strong maximum entropy perspective in some chapters. But it is also strongly pragmatic and makes some compromises.
There was an old blog by entsophy that was great. But I don’t think it is archived anywhere.
A calibrated posterior is like an unbiased estimate. All else being equal, a calibrated posterior is better than an uncalibrated one. Without calibration, you can definitely be gamed (by arbitrage). But it’s not the only consideration, as the population average prediction brought up by Anonymous demonstrates.
The other side of the coin is sharpness. Sharpness is like precision (inverse variance) in estimation. Predictions of 90% and 10% are sharper predictions than 25%. Sharpness can be measured information theoretically through entropy. The forecasting literature goes over many notions of calibration, and discusses the case Anonymous brings up of long-term averages being probabilistically calibrated.
Like with estimation, it’s possible to trade sharpenss for calibration. And it may turn out a nearly calibrated forecast with much greater sharpness is more useful for some purposes.
Decision making typically involves some notion of probability. I don’t just “predict rain when odds favor rain” and act on that. It may be worth carrying an umbrella even if there’s only a 30% chance of rain. Or consider the decision to get a flu shot. I don’t look at my odds of getting the flu and say, hey, it’s only a 10% chance, so I predict I won’t get the flu, so no shot. Similarly if you are going to make book on a footbal game or act as a market maker in the stock market, calibration is important.
There’s a whole literature on evaluating classifiers based on a notion of “proper scoring rule.” Log loss is one, and it’s closely related to calibration.
Anon:
Calibration is a mathematical property of the posterior distribution, averaging over the generative model. We want all probabilities to be calibrated, in the sense that miscalibration implies a model failure. Calibration is not enough, though: it is well understood that a model can be calibrated but not useful, if other useful information is available that is not included in the model.
Andrew, the problem anonymous is bringing up isn’t solved by “averaging over the generative model”, it’s a model-theoretic one, in the sense of mathematical model theory.
There are at least two models of probability. One for example is the model of probability in which probability describes the properties of infinite sequences of numbers… If I have a “true” random number generator that gives me bytes (numbers between and including 0 and 255), then over millions of draws from this random number generator, if I get 128 about 33% of the time, the probability of getting 128 is very close to 0.33 *by definition of the mapping between the concept of “probability” and “frequency in the long run”*
so by making a one to one correspondence between a concept in probability theory and a concept in the world or another area of mathematics (computing theory), we build “a model” of probability theory.
But, this is not a unique mapping, just like in geometry “line” is not a unique concept, we can identify the word “line” with say the great circle on a sphere, and then *there are no parallel lines* for example.
we can also map the concept of probability to “information we have about what is likely to be true”. So if I say that it’s 100% true that every human currently in the US is between 0 and 4m tall and within this range we assign normal(1.58, 0.03) it does *not* mean any of the following:
1) in 95% of worlds we might live in, if we were to average all the people in the US we would get a number between 1.52 and 1.64
2) In the future, 95% of the time that we do a census of all people in the US, we will get a number between 1.52 and 1.64
3) Averaged across countries that are similar enough to the US 95% of the countries have people who average between 1.52 and 1.64 m tall…
or any of those concepts that involve the worlds “of the time”
This is not just a little problem. this is at the heart of the problem of communicating Bayesian statistics… Because Bayesian statistics saying that the posterior probability of the average height in the US having approximately a normal(1.58,0.03) distribution should tell you something about the world… and if you misinterpret what it tells you about the world, then how can you use Bayesian statistics?
So what does the Bayesian result tell you? The best way I’ve come up with to think about it is this:
Among all the options I am willing to consider based on my modeling assumptions and knowledge of the world, 95% of the available information points to the average height being between 1.52 and 1.64 m
There is *no sense in which* this is calibratable in terms of frequency… There just isn’t a frequency involved here. It is generalizing set inclusion and exclusion: the indicator function which is 1 where the values are “included in the set I am willing to consider” and 0 elsewhere has been generalized to a smooth function which is f(x) that has the properties that it’s a non-negative number and it integrates to 1… and regions have the property that the *degree to which you would consider the possibility* is the integrated weight integral(f(x),x in region)
There are *multiple reasons why* you might assign probability in certain ways… Some of the reasons can be because of observed frequencies… but in the end… the probability is not a measure of frequency it’s a measure of “set inclusion”
To read a bayesian result, a useful way to phrase it is:
If we are willing to include values x in our decision making processes according to a weight of inclusion p(x) prior to seeing data, and we have a predictive model which is willing to include predictions according to p(data | x) then after collecting our data we should be willing to include x values in our decision making processes according to p(x | data) = normalize {p(data | x) p(x)}
“Some of the reasons can be because of observed frequencies… but in the end… the probability is not a measure of frequency it’s a measure of “set inclusion””
Hmm, not totally convinced of that.
Kolmogorov himself noted in his On Tables of Random Numbers the contribution of von Mises
“…the basis for the applicability of the results of the mathematical theory of probability to real ‘random phenomena’ must depend on some form of the frequency concept of probability, the unavoidable nature of which has been established by von Mises in a spirited manner.”
As well as in his Foundations of the Theory of Probability
“In establishing the premises necessary for the applicability of the theory of probability to the world of actual events, the author has used, in large measure, the work of R. v. Mises”
Justin
Justin,
Von Mises was a physicist (and brother of famous Austrian economist von Mises) who believed probabilities only applied to (1) systems that could be repeated, and (2) where the frequency histograms created are stable. He was very insistent about the latter. This would narrow the applicability of probability theory to far less than 1% of it’s current range.
But optimistically, I think this could be a basis of for Frequentist-Bayesian compromise. Frequentists can run free in that tiny fraction of cases von Mises would agree are appropriate, and Bayesians can have the rest.
As part of the truce, Bayesians will be good sports enough not to mention that hey can handle the stable-frequencies cases as well.
Kolmogorov built *one particular* model of probability (in the model theory sense) basing it on the notion of “games of chance” or “random phenomenon” as described by Von Mises.
But it’s not the only model of probability. So it’s important to realize what you *mean* when you use probability. And, if you choose as your model of probability the properties of “random sequences” then BY DEFINITION Bayesian probability would be “wrong”.
On the other hand, if you choose as your model of probability the Cox version… then it measures a “degree of plausibility” and BY DEFINITION the frequency version would be “wrong” when your information is different from just “how often did something happen historically”. For example, if I measure 100 randomly selected people’s heights and then say that Joe in particular has a height given by normal(mu,s) while also acknowledging that Joe has some dwarfism genes… well I’d be wrong, because it’s known that people with those genes definitely don’t have heights similar to randomly selected people… even if I don’t have a proper randomly selected sample of people with those genes.
So, it’s important to recognize what the mapping between probability numbers and concepts in the world are… otherwise you’ll get very confused.
And, it’s worth mentioning again that *it’s not unique*. Kolmogorov is *right* (about the properties of random number sequences) while at the same time *so is Cox* (about the behavior of plausibility measures)
Daniel, it might not matter much to statisticians but I think one should distinguish between mathematical models of probability and interpretations of it. For example, in principle at least, both Bayesians and frequentists could use either the Kolmogorovian or the algebraic model.
Relatedly, What is probability?.