Forget about multiple testing corrections. Actually, forget about hypothesis testing entirely.

Tai Huang writes:

I am reading this paper [Why we (usually) don’t have to worry about multiple comparisons, by Jennifer, Masanao, and myself]. I am searching how to do multiple comparisons correctly under Bayesian inference for A/B/C testing. For the traditional t-test approach, Bonferroni correction is needed to correct alpha value.

I am confused with your suggestion of not worrying about multiple comparisons. For example,

– For A/B testing, if P(A>B) > 0.975, I declare A wins.

– For A/B/C testing, does “No Correction Needed” mean that I can still use 0.975 to compare and get the same type 1 error rate as in A/B testing case?

My reply:

The published version of this paper is here.

The short answer is that I think it’s a mistake to “declare A wins” if Pr(A>B) > 0.975. The problem is with the perceived need for a deterministic conclusion. I don’t think type 1 error rates are relevant for reasons discussed here and here.

I can see that some readers might think my answer is “cheating”—my way around the type 1 error calibration problem is to say that I don’t care about type 1 error—but I’m serious. Here’s something I wrote about type 1 errors etc. back in 2004, one of our earliest blog posts.

37 thoughts on “Forget about multiple testing corrections. Actually, forget about hypothesis testing entirely.

  1. Why wouldn’t we want Pr[A > B] to be calibrated just like any other posterior expectation? I don’t see a calibration problem in theory, though it may be tough in practice to get tail probabilities with good relative error using MCMC.

    Is it that you only care about calibration for particular predictive quantities of interest and Pr[A > B] isn’t of interest?

    I’m working on a paper on differential expression of gene splice variants across treatment and control groups. The field has set the task to be null hypothesis significance testing where the null is that there is less than 5% difference in expression between treatment and control conditions (the 5% isn’t related to a p-value threshold). So the paper evaluates Type 2 (false negative) error rates at a fixed Type 1 (false positive) rate and also calculates area-under-the-ROC-curve (for simulated data, of course—nobody has the ground truth for the genome in all these experimental conditions—that’s why they’re doing experiments). I think it’d be much more useful to report the estimates and uncertainty and get a better measure of overall calibration using something like cross-validation of predictive log density. The bind we’re now in is that we’re going to have to report Type I errors at a fixed Type II error rate and then area-under-the-ROC curve. The irony here is that what they’re doing with these results is sorting the genes by p-value than looking at the ones with the lowest p-value for wet-lab followup. So the goal is really ranking, which can be done directly with Bayesian approaches (I have an example in my repeated binary trial case study for Stan).

    • Bob:

      I do want Pr(A>B) to be calibrated. But this has nothing to do with multiple testing corrections or hypothesis testing. I want Pr(A>B) to be calibrated in the Bayesian way: Averaging over all the cases when Pr(A>B) = 0.2, say, A should be actually greater than B 20% of the time.

      • That’s not the “bayesian way”. It’s the “I started out as a frequentist, switched to bayes, but can’t quite let go of frequentism way”, or ISOAAFSTBBCQLGOF way for short.

        The bayesian way would be something more like “Pr(A greater than B| evidence+assumptions) = .2 means 20% of the possibilities left unspecified by evidence+assumptions make A greater than B”.

        To illustrate the difference suppose we’re trying to predict which days it’s going to rain. Here’s some calibrated probabilities:

        day1 day2 day3 day4 day5 day6
        Pr:

      • That’s not the “bayesian way”. It’s the “I started out as a frequentist, switched to bayes, but can’t quite let go of frequentism way”, or ISOAAFSTBBCQLGOF way for short.

        The Bayesian way would be something more like “Pr(A greater than B| evidence+assumptions) = .2 means 20% of the possibilities left unspecified by evidence+assumptions make A greater than B”.

        To illustrate the difference suppose we’re trying to predict which days it’s going to rain. Here’s some perfectly calibrated probabilities determined by analyzing historical frequencies:

        1Feb 2Feb 3Feb 4Feb
        Pr(rain) .25 .25 .25 .25
        outcome rain dry dry dry

        Alternatively I go out and create Physics weather models using distributions to model any parameter I’m uncertain about (one of my stat thesis advisors spent his career doing just that). I then run simulations using different values for those parameters and let Pr(rain|model) be the percentage of simulations that give rain for that day.

        Note: This probability has nothing to do with what frequency it will rain over a number of different days, but rather is what percentage of unknowns led to rain for that specific day This view of probability is Laplace’s old “ratio of favorable cases to all cases” and not a frequentist one. Based on this, I produce,

        1Feb 2Feb 3Feb 4Feb
        Pr(rain) .9 .1 .1 .1
        outcome rain dry dry dry

        The former is “calibrated” in the sense you describe, while the later is not. Yet if we use the prediction rule “Predict rain when odds favor rain” the “calibrated” version will be wrong 25% of the time over the next 4 days, while the later will be wrong 0% over the next four days!

        If you think about it, the “calibration” you describe is a long held prejudiced which isn’t backed up by any analytical evidence or requirement from the mathematics itself.

        • I share this attitude towards calibration. I actually use a similar forecasting example in my book. I learned it from Jaynes, I am pretty sure.

        • Hello McElreath,

          Jaynes would have pointed out the later is “more informative” than the calibrated version and backed it up by computing the entropy of both, with the later having smaller entropy.

          Then he would have said the calibrated version isn’t “wrong”, just appropriate for the low informational state where all you know is some historical frequencies; as typical of the pre-DNA genetics stuff Fisher liked to work on.

          The later isn’t “right”, it’s just appropriate when you know more; for example when you have at hand all of physics as background knowledge, the way Jeffreys did in Geophysics.

          At the end of the day though, the calibration Andrew mentioned isn’t some kind of universal Bayesian requirement.

        • The Rosenkrantz edited volume of Jaynes’ papers is what I found most rewarding. But it’s not one long argument, so requires some work. Jaynes’ Probability Theory book is more organized.

          My own book is rather Jaynesian, with a strong maximum entropy perspective in some chapters. But it is also strongly pragmatic and makes some compromises.

          There was an old blog by entsophy that was great. But I don’t think it is archived anywhere.

        • Replying to Richard McElreath about entsophy blog.

          It looks like a couple of dozen posts from Entsophy are archived on the Wayback machine:

          https://web.archive.org/web/*/http://www.entsophy.net/blog/*

          From the default sorting, some are at the start of the list and some are at the end. When you click onto an entry, you’ll need to look for snapshots that date before around June 2014, as that looks like when the blog got hacked.

        • A calibrated posterior is like an unbiased estimate. All else being equal, a calibrated posterior is better than an uncalibrated one. Without calibration, you can definitely be gamed (by arbitrage). But it’s not the only consideration, as the population average prediction brought up by Anonymous demonstrates.

          The other side of the coin is sharpness. Sharpness is like precision (inverse variance) in estimation. Predictions of 90% and 10% are sharper predictions than 25%. Sharpness can be measured information theoretically through entropy. The forecasting literature goes over many notions of calibration, and discusses the case Anonymous brings up of long-term averages being probabilistically calibrated.

          Like with estimation, it’s possible to trade sharpenss for calibration. And it may turn out a nearly calibrated forecast with much greater sharpness is more useful for some purposes.

          Decision making typically involves some notion of probability. I don’t just “predict rain when odds favor rain” and act on that. It may be worth carrying an umbrella even if there’s only a 30% chance of rain. Or consider the decision to get a flu shot. I don’t look at my odds of getting the flu and say, hey, it’s only a 10% chance, so I predict I won’t get the flu, so no shot. Similarly if you are going to make book on a footbal game or act as a market maker in the stock market, calibration is important.

          There’s a whole literature on evaluating classifiers based on a notion of “proper scoring rule.” Log loss is one, and it’s closely related to calibration.

        • The Pr(rain) = 0.25 model is well-calibrated in the small, but not in the large. Calibration in the small helps you understand how well your model predicts probabilities across the range of predicted risk, and if your model is calibrated at a broad range of the (0,1) domain, I’m pretty sure it would do much better than the dumb base rate model. To me, this is a straw man argument against the importance of calibration.

      • Anon:

        Calibration is a mathematical property of the posterior distribution, averaging over the generative model. We want all probabilities to be calibrated, in the sense that miscalibration implies a model failure. Calibration is not enough, though: it is well understood that a model can be calibrated but not useful, if other useful information is available that is not included in the model.

        • Andrew, the problem anonymous is bringing up isn’t solved by “averaging over the generative model”, it’s a model-theoretic one, in the sense of mathematical model theory.

          There are at least two models of probability. One for example is the model of probability in which probability describes the properties of infinite sequences of numbers… If I have a “true” random number generator that gives me bytes (numbers between and including 0 and 255), then over millions of draws from this random number generator, if I get 128 about 33% of the time, the probability of getting 128 is very close to 0.33 *by definition of the mapping between the concept of “probability” and “frequency in the long run”*

          so by making a one to one correspondence between a concept in probability theory and a concept in the world or another area of mathematics (computing theory), we build “a model” of probability theory.

          But, this is not a unique mapping, just like in geometry “line” is not a unique concept, we can identify the word “line” with say the great circle on a sphere, and then *there are no parallel lines* for example.

          we can also map the concept of probability to “information we have about what is likely to be true”. So if I say that it’s 100% true that every human currently in the US is between 0 and 4m tall and within this range we assign normal(1.58, 0.03) it does *not* mean any of the following:

          1) in 95% of worlds we might live in, if we were to average all the people in the US we would get a number between 1.52 and 1.64

          2) In the future, 95% of the time that we do a census of all people in the US, we will get a number between 1.52 and 1.64

          3) Averaged across countries that are similar enough to the US 95% of the countries have people who average between 1.52 and 1.64 m tall…

          or any of those concepts that involve the worlds “of the time”

          This is not just a little problem. this is at the heart of the problem of communicating Bayesian statistics… Because Bayesian statistics saying that the posterior probability of the average height in the US having approximately a normal(1.58,0.03) distribution should tell you something about the world… and if you misinterpret what it tells you about the world, then how can you use Bayesian statistics?

          So what does the Bayesian result tell you? The best way I’ve come up with to think about it is this:

          Among all the options I am willing to consider based on my modeling assumptions and knowledge of the world, 95% of the available information points to the average height being between 1.52 and 1.64 m

          There is *no sense in which* this is calibratable in terms of frequency… There just isn’t a frequency involved here. It is generalizing set inclusion and exclusion: the indicator function which is 1 where the values are “included in the set I am willing to consider” and 0 elsewhere has been generalized to a smooth function which is f(x) that has the properties that it’s a non-negative number and it integrates to 1… and regions have the property that the *degree to which you would consider the possibility* is the integrated weight integral(f(x),x in region)

          There are *multiple reasons why* you might assign probability in certain ways… Some of the reasons can be because of observed frequencies… but in the end… the probability is not a measure of frequency it’s a measure of “set inclusion”

        • To read a bayesian result, a useful way to phrase it is:

          If we are willing to include values x in our decision making processes according to a weight of inclusion p(x) prior to seeing data, and we have a predictive model which is willing to include predictions according to p(data | x) then after collecting our data we should be willing to include x values in our decision making processes according to p(x | data) = normalize {p(data | x) p(x)}

        • “Some of the reasons can be because of observed frequencies… but in the end… the probability is not a measure of frequency it’s a measure of “set inclusion””

          Hmm, not totally convinced of that.

          Kolmogorov himself noted in his On Tables of Random Numbers the contribution of von Mises

          “…the basis for the applicability of the results of the mathematical theory of probability to real ‘random phenomena’ must depend on some form of the frequency concept of probability, the unavoidable nature of which has been established by von Mises in a spirited manner.”

          As well as in his Foundations of the Theory of Probability

          “In establishing the premises necessary for the applicability of the theory of probability to the world of actual events, the author has used, in large measure, the work of R. v. Mises”

          Justin

        • Justin,

          Von Mises was a physicist (and brother of famous Austrian economist von Mises) who believed probabilities only applied to (1) systems that could be repeated, and (2) where the frequency histograms created are stable. He was very insistent about the latter. This would narrow the applicability of probability theory to far less than 1% of it’s current range.

          But optimistically, I think this could be a basis of for Frequentist-Bayesian compromise. Frequentists can run free in that tiny fraction of cases von Mises would agree are appropriate, and Bayesians can have the rest.

          As part of the truce, Bayesians will be good sports enough not to mention that hey can handle the stable-frequencies cases as well.

        • Kolmogorov built *one particular* model of probability (in the model theory sense) basing it on the notion of “games of chance” or “random phenomenon” as described by Von Mises.

          But it’s not the only model of probability. So it’s important to realize what you *mean* when you use probability. And, if you choose as your model of probability the properties of “random sequences” then BY DEFINITION Bayesian probability would be “wrong”.

          On the other hand, if you choose as your model of probability the Cox version… then it measures a “degree of plausibility” and BY DEFINITION the frequency version would be “wrong” when your information is different from just “how often did something happen historically”. For example, if I measure 100 randomly selected people’s heights and then say that Joe in particular has a height given by normal(mu,s) while also acknowledging that Joe has some dwarfism genes… well I’d be wrong, because it’s known that people with those genes definitely don’t have heights similar to randomly selected people… even if I don’t have a proper randomly selected sample of people with those genes.

          So, it’s important to recognize what the mapping between probability numbers and concepts in the world are… otherwise you’ll get very confused.

          And, it’s worth mentioning again that *it’s not unique*. Kolmogorov is *right* (about the properties of random number sequences) while at the same time *so is Cox* (about the behavior of plausibility measures)

        • Daniel, it might not matter much to statisticians but I think one should distinguish between mathematical models of probability and interpretations of it. For example, in principle at least, both Bayesians and frequentists could use either the Kolmogorovian or the algebraic model.

        • There is *no sense in which* this is calibratable in terms of frequency

          So I have to take issue here, on account of my stance that the Bayesian approach subsumes frequency interpretations of probability. Establish any class of exchangeable estimands (and recall independence implies exchangeability) and as a matter of math there is a frequency that can be estimated too. So on its own the estimand you consider cannot be calibrated, but if you’re estimating more than one independent or exchangeable thing then it does make sense to ask if you’re doing a good job overall. If your estimates aren’t calibrated then something is wrong — you have palpably failed to count either or both of Laplace’s “favourable cases” and “all cases” correctly.

        • I’ll have to think about it. You can estimate a frequency, but how could you calibrate your estimate without knowing the “true” frequency? Sometimes in a simulated situation you could. So it’s definitely possible to investigate how well a class of models works when the model is applied to a known frequency based generative process (which is not necessarily the same as the one modeled in the model).

          I guess it’s just important in my opinion to realize that not every probability *can* be calibrated because it doesn’t always represent repeated anything. If there’s no repetition, then there’s no Kolmogorov/high complexity/random sequence probability to discuss.

        • I don’t think it’s the case that say 70% of your predictions that give a 70% chance of occurring should in fact come true. This is somehow an additional assumption for Bayes, and it can be an appropriate assumption in some circumstances, but if 100% of your p=0.7 predictions come true, I don’t think that’s evidence that you got it wrong in an important way… unless you make this additional assumption, which is OK.

          There’s a difference between “70% of the cases that I think could be true points to X happening” and “70% of the time that I said ‘70% of the cases that I think could be true points to X happening’, X in fact did happen”

          Not having some information that would lead you to predict p = 1.00 is not a failure of the math, it’s a failure to have excellent information.

          For example: one person observes the output of a cryptographic random number generator from afar and says “there’s a 50% chance the next bit will be a 1”, another person observes the internal state of the electronic circuit and says “there’s a 100% chance the next bit will be a 0”

          both of them are right.

        • I guess you could argue that the person predicting 50% chance of the next bit being a 1, over time, should probably find that 50% of the time the next bit really was a 1, or they are doing something wrong.

          But suppose for example it’s a not very good crypto-random-number generator… it actually produces say 65% 1 values. This is because of some electrical interference that slightly biases the ADC towards producing a 1 value, and a faulty debiasing circuit.

          A person reads the spec sheet on the IC, says “hey this is a pretty well designed IC, it should produce 50% 1 values” and operates from the state of information “the spec sheet is accurate”… they make the prediction “50% of bits are 1”

          Yes, their state of information is inaccurate compared to the real-world situation… But their prediction of 50% of bits being 1 *is an accurate reflection of their state of information*. It’s just not an accurate reflection of the actual state of the world.

        • > if 100% of your p=0.7 predictions come true, I don’t think that’s evidence that you got it wrong in an important way

          Is there any outcome that would provide evidence that you got it wrong in an important way?

        • Just think of doing simulation for a posterior probability distribution on a set of indicator variables stratified by their posterior expectations. Assessing calibration is just checking if the realized values of the indicator variables looks like a typical realization from the posterior predictive distribution. I think this is worth worrying about because of the consequences of miscalibration for Bayesian decision theory and its injunction that we should minimize posterior expected loss. If the actual loss function (insofar as that exists) isn’t in the typical set of the posterior for loss functions then we’ve got problems.

        • If you want to predict the frequency of an outcome in a sequence x1, …, xn, then view that real-world sequence as a singular non-repeatble event (which it is in reality) and put a Bayesian distribution on it P(x1,…,xn). Then use it generate the distribution for:

          f_x = (1/n)\sum_i \delta(x-x_i)

          That distribution P(f_x) prodies you the estimate for the frequency and it’s reasonable error range.

          Trying to do this by hacking/destroying every equation and meaning in statistics so the marginal distribution P(x_1) is roughly equal to this estimate is the root of all evil in statistics.

          And it remains evil even though occasionally P(x_1) is roughly equal to the frequency estimate. It can easily happen (and usually does in real world scenarios) that as P(x1,…,xn) shrinks around the true sequence, giving you a more accurate estimate of the frequency, then P(x_1) will get less “calibrated” not more.

        • Let me add to that last comment. Ideally we want a P(x1,…,xn|K) that shrinks around the true sequence, but obtaining knowledge K that achieves that may be impossible, or prohibitively expensive, or whatever. So we go with a weaker K’ that’s not as good, but somehow more practicle.

          This may have no effect on how well we estimate f_x. Why? because not all “errors” in P(x1,…,xn) matter for P(f_x) and hence estimating the frequency f_x. So in practice, we may deliberately choose a “bad” P(x1,…,xn) that still does a good job of estimating f_x.

          This can be formalized substantially. Any two probabilities P_1(x1,…,xn) and P_x(x1,…,xn) which produce the same P(f_x) will give us the same frequency estimates. So in practice, it’s OK to find a P(x1_,…,xn) that’s as “spread out as possible” while still giving the same P(f_x).

          Such a distribution will be the easiest to obtain (require the least inputs K) and still give good frequency estimates. What it wont be is “calibrated” in just about any sense a frequenist would care to dream up.

        • I think Anonymous has useful commentary here. Carlos asks if there is any evidence that would say you got it wrong in an important way.

          Since the Bayesian model is about *what your information tells you* (rather than what really happens) information that shows you didn’t accurately encode what you knew would be useful.

          For example, suppose you read the spec sheet on the crypto RNG IC and it says that in the presence of a strong RF field in the microwave range you will get biased outputs but that the bias will be unknown size and direction… and also you realize that the person doing the tests is doing them next to a microwave oven… Now you have evidence that you encoded your knowledge incorrectly, both the priors and the likelihood are probably wrong relative to what you actually know about the scenario.

        • Corey: also, I’m all for assessing the information content. Surprisal which is -log(p(outcome)) seems useful here rather than a frequency based measure.

          if -log(p(outcome)) is very large, particularly relative to the posterior predictive distribution of -log(p(outcome)) then there is something going on in the world that “you don’t know about” (ie. it’s not built into your model).

          That is *absolutely* a useful thing to check.

        • No one statistic is going to suffice for understanding the fit of your models; -log(p(outcome)) can be low because of simple overconfidence or it can be be low because even the point predictions are inaccurate. Also, if you are modelling observation noise as IID and are fitting the variance then -log(p(outcome)) won’t help you diagnose model lack-of-fit; you have to explicitly model the lack of fit with correlated noise (or equivalently, a semi-parametric model).

        • Corey, certainly it’s worth looking at more than one statistic, but I think the general idea of surprisal is worth using more often. For example suppose you are using Anonymous’ suggestion to look at the posterior distribution over the frequency of occurrence . it will give you a density over the observed frequency. when that is tightly peaked, then small deviations from the peak are evidence of a problem, when it’s wide then small deviations are not a problem… how can we decide? surprisal will work well here.

          I’m not quite sure what your comments about fitting variance mean. if your model can allow for very large variances then it can make anything not too surprising (or rather sort of equally surprising). is that what you are referring to? if so then I agree. so we need more than surprisal. there is also the concentration of the prediction to consider for example. for this it seems entropy would be useful. I also think it’s probably useful to consider the surprisal under the prior of typical posterior draws of parameters. this can diagnose posterior/prior disagreement.

        • Daniel, regarding fitting variance, I’m just talking about

          y_i ~ N(expected_y_i, sigma^2)

          with a not too peaked prior for sigma^2 where expected_y_i is some function of predictors and parameters.

          To see what happens to sigma^2 when the function defining expected_y_i is misspecified, check out the decomposition of the sum of squared errors here: https://en.wikipedia.org/wiki/Lack-of-fit_sum_of_squares (keeping in mind that that decomposition is a mathematical identity not particularly tied to any particular approach to statistics). Because the model doesn’t “know” that the function can be wrong, it does what we told it to do and treats model lack-of-fit as equivalent to idiosyncratic noise. This serves as an example of the sort of problem that posterior predictive checking of -log(p(outcome)) can’t diagnose.

        • Bayes is doing the right thing here. The only way you can diagnose a lack of fit is if you have some prior idea of what your error “should be” which means an informative prior on the sigma. Otherwise your model is saying “y is sort of like expected_y, except the difference could be anything in a huge range”… so big differences aren’t surprising.

          If you have an informative prior on sigma, and compare posterior sigma to prior sigma via surprisal, you’ll find that you’re quite surprised to find the sigma in a different part of the sigma space than what you expected after fitting.

          It is unsurprising that if nothing surprises you, you will not be surprised by anything ;-)

        • Random followup thought.

          As a check on model fit, it could be useful to specify a-priori an entropy in bits that you expect for posterior predictive data (ie. “how tightly should this model predict”). You can then calculate the actual posterior predictive entropy (estimated I guess by taking samples, using the proportional density calculated at the samples, normalizing the density to add to 1, and taking the mean log of that density)

          If your model doesn’t predict as well as you thought, then it may not model the process as well as you thought. At this point you have a principled way to decide whether your model has problems.

          Which is related to but not the same as surprisal of the actual data.

          Since it might be hard to estimate the entropy in bits a-prior directly, to help get a handle on this quantity, you could specify a sort of reference prediction error, such as “After fitting I expect this will fit at least as well as normal(true_mean, model_sigma)” calculate the entropy of that in the same way, and then compare the posterior entropy to this quantity to see if it’s about the same or smaller, indicating that the model is doing around as well as you thought it would…

          This would diagnose lack of sharpness or excessive sharpness, but would not necessarily diagnose bias or shape-misfit. However, bias and shape-misfit can be partially diagnosed by data surprisal… so the combination while maybe not being the be all end all, should probably be a common workflow.

        • Daniel, we’re more-or-less in agreement; there are some technical details that could be worked out relating to the fact that the model the noise is independent and model lack-of-fit will be detectable as a violation of this assumption but I will leave it here.

Leave a Reply to Anonymous Cancel reply

Your email address will not be published. Required fields are marked *