# “What are some situations in which the classical approach (or a naive implementation of it, based on cookbook recipes) gives worse results than a Bayesian approach, results that actually impeded the science?”

Phil Nelson writes in the context of a biostatistics textbook he is writing, “Physical models of living systems”:

There are a number of classic statistical problems that arise every day in the lab, and which are discussed in any book:

1. In a control group, M untreated rats out of 20 got a form of cancer. In a test group, N treated rats out of 20 got that cancer. Is this a significant difference?
2. In a control group of 20 untreated rates, their body weights at 2 weeks were w_1,…, w_20. In a test group of 20 treated rats, their body weights at 2 weeks were w’_1,…, w’_20. Are the means significantly different?
3. In a group of 20 rats, each given dose d_i of a drug, their body weights at 2 weeks were w_i. Is there a significant correlation between d and w?

I would like to ask: What are some situations in which the classical approach (or a naive implementation of it, based on cookbook recipes) gives worse results than a Bayesian approach, results that actually impeded the science? (No doubt both approaches agree if the 20 rats are replaced by 20000.)

That is, there must be cautionary case studies in which the assumptions of classical statistics were proved not useful for some real experiment. Such case studies in my opinion are invaluable for focusing students’ attention, particularly if they have already been subjected to a cookbook statistics course.

I’ll always answer a question from a physicist! So here’s what I told him:

Yes, I have an example for you. It is a study with n=3000, looking at the attractiveness of parents and the sexes of their children.

The published analysis compared the proportion of girl births among the parents who were labeled “very attractive,” compared to the proportion of girl births of the other parents. The difference was 0.08 with a standard error of 0.03, thus statistically significant.

However, there is a lot of prior information on this topic. It would be inplausible for the true difference in the population to be as large as 0.01. A reasonable prior distribution might have a mean of 0 and a standard deviation of 0.003. Under such a prior, the Bayesian inference is that the population difference is very close to 0.

To be precise, the posterior mean is 0.0008 (that is, less than 1/10 of one percentage point, for example Pr(girl) changing from 0.488 to 0.489) with a posterior standard deviation of 0.003. Thus, in the Bayesian analysis, the result is not anything close to statistically significant.

In this case, the bad, non-Bayesian, answer impeded the science, at least in the sense that it resulted in a wrong result being published in a reputable journal (Journal of Theoretical Biology, impact factor 3) and also used as the basis of a pop-science book.

Further background is here.

This is the example that keeps on giving. A wonderful illustration of the principle that God is in every leaf of every tree.

## 51 thoughts on ““What are some situations in which the classical approach (or a naive implementation of it, based on cookbook recipes) gives worse results than a Bayesian approach, results that actually impeded the science?””

1. This would have made a good entry for the Big Bayes stories!

• Yes, but it would fail miserably in the “importance” category . . . More seriously, I do wish I’d included this in Chapter 1 of BDA3, giving it a section title such as Strongly Informative Priors.

• But doesn’t that example bite both ways? If the prior information on the topic was that attractive parents have more girl children, a good single study would impede science more in a Bayesian framework than otherwise.

It’s not win-win is it?

• No, my point is that the classical estimate is just noise here. I think that chasing noise impedes science. I’m not saying that Bayes solves all problems. What I did in the above post was respond to a correspondent who asked for an example. I gave him the cleanest example I could think of, an example where there is a huge amount of prior information relative to the data.

2. Andrew: Post data, one could well find any number of methods do well or better than some other (including those one wouldn’t dream of using), but actually, I’m surprised you’d mention this one because, as I recall, your criticism was based on the analysis being guilty of ignoring (fairly egregious) multiple comparisons—a direct frequentist error-statistical (not an obviously Bayesian) concern. Taking that into account, the result is no longer statistically significant, and error statistical reasoning without a prior precludes the problematic confidence interval construal. Now maybe you’ll say these are not unthinking applications… I! believe at least part of the discussion in this comment (but we may have discussed it elsewhere as well):

http://errorstatistics.com/2013/06/14/p-values-cant-be-trusted-except-when-used-to-argue-that-p-values-cant-be-trusted/comment-page-1/#comment-13120

• Mayo:

The multiple comparisons issue is there, but that’s not the whole story. As we discussed in our paper (I think), even if the result had been legitimately statistically significant (for example, had the slope of the regression been more than 2 standard errors away from zero), I still wouldn’t believe the claim. So, yes, in this sort of problem, the classical approach of taking the confidence interval and declaring victory when it excludes zero will give worse results and impede the science. Multiple comparisons just makes this worse, by allowing the science to be impeded close to 100% of the time rather than merely close to 5% of the time.

3. Andrew two issues:

1. The example about rats involves (I presume) RCTs, whereas the example you provide appears to be observational. Does this not make a difference with regards to Phil’s question? I.e. whether classical inference for analyzing _experimental data_ is so bad it impedes scientific progress compared to Bayesian inference? (At least that is how I interpret the question).

2. The example you provide is one where the prior is strong, I believe correctly, and uncontroversial. Now I could argue — with a contrived example no doubt — that the Church authorities that condemned Galileo had a very strong (and wrong) prior about heliocentrism. Galileo would have had to run many more experiments in order to convince authorities than he was actually able to do under the Church’s “Bayesian” framework.

My more general point is that to properly answer Phil’s questions we need to look at the operating characteristics of the Bayesian framework over the Classical in the context of experiments, and relate it to some loss function. Anecdotes may be helpful but not determining.

PS I actually think we need to do both types of analysis, the difference being in the kinds of assumptions each makes.

• In a randomized experiment with complete data, I believe that a “classical”, randomization-based approach would always give at least as reliable (or valid) results as a Bayesian or any model-based, assumption-laden analysis.

• Agree. But Bayesian analysis can take it from there i.e. add some assumptions for deeper insights. Hence the value of doing both.

• Mark:

You write, “In a randomized experiment with complete data, I believe that a “classical”, randomization-based approach would always give at least as reliable (or valid) results as a Bayesian or any model-based, assumption-laden analysis.”

I disagree very strongly. In a setting where data are much weaker than prior information, a standard classical analysis will be much worse than a standard Bayesian analysis. Again, consider my example above, which is a random sample survey, which has the same mathematical properties as a randomized experiment with complete data. In this case, the problem with the classical analysis is not any problem with the randomization, rather, the problem is that the data contain very little information compared to what is already known from the scientific literature.

To put it another way: In my example (and others like it), “assumption-laden” is good. You seem to think of assumptions as a bad thing, but “assumptions” is another way of saying “information.” With more information we can give better inferences. With less information we can end up chasing noise, which can indeed waste a lot of people’s time. Or, to put it in the terms you used above, the classical estimate in the presence of lots of noise will be less reliable and less valid than the Bayesian analysis.

If you don’t believe that prior information and assumptions are useful, I challenge you to pull a die out of your nearest Monopoly game, roll it a few times, then tell me the probability you think the next roll will be a “6.” The die rolls are random, but I doubt you’d believe the classical estimate!

• Andrew, I’m sorry but I find that your response doesn’t hold much water, and involves yet another (invalid) assumption, either one that the authors of the paper actually made or one that you ascribe to them (I haven’t read the original paper, but I’m guessing it’s the latter otherwise the paper probably would have never been published in a widely read journal, for reasons that I’ll get to below).

I agree with you that random survey samples have the same mathematical properties as randomized trials with complete data, but only to the extent that they are used to estimate some actual characteristic in some well-defined population for which each member has a known probability of being selected into the sample. (Again, I haven’t read the original paper, but I’ll take your word for it that they used solid probability sampling methods.). Now, paraphrasing your summary of their paper, they estimated that attractive parents (however the hell that was measured, but I’ll give them the benefit of the doubt) *in this specific well-defined population* HAD a much higher proportion of daughters than sons, or something along those lines. I find this neither remarkable nor am I inclined to think that it’s an invalid estimate; I am very surprised that such a finding would be at all news worthy to anybody outside of this immediate population. Unless they make the (invalid) assumption that such a result will continue to hold in this population in the future or that it holds as a law of nature in other populations. That’s where the problem arises, and it has absolutely NOTHING to do with their use of “classical” methods of analysis.

Now, a reasonable scientist might respond something like “but we do studies to try to find something true about the world, we don’t care about gender ratios in this particular (past) population.” I agree. But random sampling by itself can’t get you there (random sampling coupled with strong a priori predictions and replications might). There’s an old saying that “a poor worker blames his tools.”

Yes, I do think assumptions are mostly bad. I don’t value expert opinion as “information” in cases where there’s no strong reason to believe that the future will necessarily be like the past (as Taleb wrote, expert opinion would seem valuable when it comes to judging cattle, less so when it comes to things like human psychology or behavior). In the latter case, I think of expert opinion more as noise which does not necessarily lead to better inferences (unless there’s some kind of annealing that goes on).

As to your dice example, I have no idea where that came from, unless you’re simply criticizing the classical view of probability. Personally, I don’t hold the classical view of probability, I subscribe most closely to Popper’s propensity view that probability is a feature of an experimental setup, not of the device itself (which would allow for things like imperfectly weighted dice or shrewd rollers).

• Mark:

You can follow the link and read the paper. In short: they are using information from a sample to draw inferences about the general population. Even if the sample they were using were a simple random sample, it turns out that there’s a lot more information about their parameter of interest in the prior literature than in the data.

That happens sometimes: even when there is perfect randomization, if sample size is too small for what is being estimated, the data will be less informative than the prior.

• Mark, Which classical test you use depends on an expert’s opinion (regarding the data) doesn’t it? The significance level depends on experts’ opinions doesn’t it? Which measurements were considered, what sample was achieved, … all of these things depend on expert opinions do they not? Your insistence that “assumptions” are “mostly bad” is just a redefining of what assumptions are, I think.

• Rahul:

Monopoly dice are the same as regular dice. Here’s the point: if you roll a die a few times, you will get a few numbers. From these you can do inference about various properties of the die such as the probability the die comes up 6. In such inference, the prior will dominate, and the data from your few rolls will be essentially irrelevant. There is much much more info in the prior (basically, we know ahead of time that there is a very high probability that all six probabilities are very close to 1/6) than in the data. Mathematically this is very similar to the sex-ratio example discussed in the blog post.

I gave this example to respond to a commenter who wrote: “In a randomized experiment with complete data, I believe that a ‘classical’, randomization-based approach would always give at least as reliable (or valid) results as a Bayesian or any model-based, assumption-laden analysis.” In this example, the data are coming in completely randomly (you can’t get much more random than die rolls), but the prior information is strong and nobody would ignore it.

• This kind of concern …

“Galileo would have had to run many more experiments in order to convince authorities than he was actually able to do under the Church’s “Bayesian” framework.”

… really disconcerts me, and I’m glad you brought it up!

How *does* one avoid using their prior to protect an untenable belief? I *think* this is part of the motivation for all of the ‘objective’ and ‘non-informative’ families, but that really doesn’t seem like it strikes quite the right balance. On the other hand, just throwing out priors that lie somewhere in the middle and declaring them as clearly adequate seems less than compelling, but I also have no idea what a formal theory of “optimal prior construction” would look like, or if that’s maybe even oxymoronic somehow.

• From wikipedia with references therein:

“Galileo’s championing of heliocentrism was controversial within his lifetime, when most subscribed to either geocentrism or the Tychonic system.[9] He met with opposition from astronomers, who doubted heliocentrism due to the absence of an observed stellar parallax.[9] The matter was investigated by the Roman Inquisition in 1615, and they concluded that it could be supported as only a possibility, not an established fact.[9][10]”

So it looks like that prior believe was actually based on some kind of evidence and it wasn’t really that hard to convince authorities of the possibility of heliocentrism. Which is especially amazing in this case because Galileo performed precisely zero experiments to verify that the Earth revolved around the sun.

So that busts that little dig at Bayesians all to hell.

• Ok. Priors should encode information/evidence that’s true. That information needn’t always be “information about frequencies”. If you don’t see how to encode more general types of information into a probability distribution, then just stick with p-values and whatnot.

Just like if you don’t see how to use Newton’s Laws to predict the motion of the planets, then go back to the Ptolemaic system. Epicycles worked just fine for a 1000 years. And just like epicycles, P-values and CI’s will work with about the same level of performance as you see today, for the next 1000 years as well.

• Anonymous:

The prior has some information—it is essentially equivalent to some amount of past data. You run the experiment because that gives you more information.

• It only encodes partial information. If you have enough information to deduce the answer you wouldn’t bother with statistics.

For example I know Andrew weights more than field mouse and less than a Mac truck. If I restrict the search for a prior on his weight to only those distributions with support within that range, then that successfully encodes that particular piece of non-frequency information. But I still don’t know Andrew’s weight.

• So if I read you both correctly the Bayesian approach is always best bc the prior ALWAYS contains true (if partial) information.

If so I would agree but can you prove that priors are ALWAYS (partially) correct? I think this is what people question. How much does the US spend on foreign aid as % GDP? If you put a prior from the average american on a random sample from US buget lines you might get a lot of bias.

Such prior may not have come from data at all, or the data it came from is not exchangeable with the present population, or may have been collected through a faulty instrument, etc.

But obviously I agree. If the prior is always infallible then use it always.

• “So if I read you both correctly the Bayesian approach is always best bc the prior ALWAYS contains true (if partial) information.”

That’s an insane reading of what we said. Here’s a better version:

“It’s up to you to ensure your priors only encode true information. If it doesn’t then you’re probably screwed.”

How is this even controversial?

• > known to be true

For known to be true, strictly speaking should be read, tentatively accepted that further checking at present will not be productive.

Priors need to be checked as much as is reasonably possible.

Not only does the world change (and a prior become more wrong) opportunities to better check them can arise.

(A bit surprised to not find a comment on checking priors from Andrew here.)

• “If the prior is known to be true”

That’s the crux of the whole problem.

I once invented a magical fairy friend which had no physical manifestation of any kind, but would somehow say things like “The moon is made of cheese” or “the moon weighs less than the earth”.

This magical fairly has exactly the same characteristics as a probability distribution:

(1) It’s entirely made up by the human mind.
(2) You can’t physically touch it or identify it anywhere in the real world.
(3) It makes claims about the real world which may or may not be true.

• Entsophy

“It’s up to you to ensure your priors only encode true information”

This sounds to me like picking yourself up from your own bootstraps. I thought the whole point here is that truth — even partial truth — is what is under investigation.

I envy your sense of infallibility.

• Every example of human reasoning made by anyone under any circumstances uses one set of information to make claims about the truth/falsity of other information.

• Anonymous:

This is a bit frustrating, but I will try one more time. You asked, “If the prior is known to be true, why run the experiment.” I replied that, you run the experiment because that gives you more information. You then wrote, “So if I read you both correctly the Bayesian approach is always best bc the prior ALWAYS contains true (if partial) information.” No, I did not say the Bayesian approach is always best. Nor did I say that the prior always contains true information. (Actually, I’m not sure what that means.)

• Andrew:

You are correct. Phil is only asking for examples where clasical stuff would do worse. That is what you did.

I thought that was too easy and then provided a contrived example for when Bayes might go wrong. Then others commented Bayes can never go wrong (my interpretation) and so on.

Not your fault.

• For an example where routine Bayes goes wrong, see section 3 of this paper.

For an even simpler example where routine Bayes goes wrong, consider this example: we assign a flat noninformative prior to a continuous parameter theta. We now observe data, y ~ N(theta,1), and the observation is y=1. This is of course completely consistent with being pure noise, but the posterior probability is 84% that theta>0. I don’t believe that 84%. I think (in general) that it is too high.

• @Andrew

“For an even simpler example where routine Bayes goes wrong, consider this example: we assign a flat noninformative prior to a continuous parameter theta. We now observe data, y ~ N(theta,1), and the observation is y=1. The data is of course completely consistent with being pure noise, but the posterior probability is 84% that theta>0. I don’t believe that 84%. I think (in general) that it is too high.”

This intrigues me. What do you think the p(theta>0) ought to be? I guess you are saying that you think the “noninformative” prior is unreasonable? An informative N(0,1) prior would take us down to p(theta>0) = 76% — is that too high/low?

• Mikkel:

It depends on the context, of course, but I think in most cases that N(0,1) would be better than N(0,infinity), for the usual Lindley-paradox reason that if the correct prior really were N(0,A^2) with some large A, then it would be extremely unlikely to observe |y| to have such a low value as 1.

• Wow, really?

The N(0,A^2) prior says the parameter is in [-2A,A2] somewhere for some large A

The measured data says it’s in [-1,3] somewhere

The posterior puts these two together and says “the odds are the parameter is greater than zero”.

Are you seriously citing this as an example of Bayes-gone-bad?

• I think one way to look at this debate is that we are all Bayesians. Classical statisticians simply leave the prior implicit (a default prior).

So the debate is not about priors or no priors but about what kind of prior. In this scheme the classical position is to stick with the default prior. The Bayesian position is to use a better prior if you have it. The classical retort then might be to question how come you know you have a better prior for this specific application. And so on.

Since the applicability of one prior or another to this specific application is an assumption, then the natural thing to do, IMO, is to test the robustness of findings to both sets of assumptions. If they differ we can investigate further.

However, to use a Git analogy, I think the Bayesian approach ought to be the main branch. The classical tests etc ought to be a side branch to check robustness.

• A number of folks have replied that you should check your priors, that the prior is intended to capture preexisting knowledge/information, that you should ensure that your prior is a reasonable representation of what is and isn’t known a priori, etc. I think my disconcertment is rooted in reasons for worrying that this sort of construction and checking of priors is more subjective art than objective science, and is possibly a poor representation of the actual information had originally.

To try to be more careful, I think this leaves me with 4 I hope reasonably specific concerns:

– suppose human beliefs are not genuinely expressible as Savage/de Finetti-style subjective probabilities; to be specific, maybe beliefs are better described as convex sets of “possibilities,” with ambiguity aversion to larger such sets, as in a lot of the behavioral decision-making literature. Can forcing people to express their prior beliefs probabilistically then in any reasonable sense be said to accurately model their prior beliefs?

– while remaining agnostic about the particular features of a good representation of human beliefs, suppose that human beliefs, even if in principle probabilistically state-able, are not readily accessible to introspection or to elicitation through some kind of experimental choice between prospects. If this kind of inaccessibility is the case, what exactly are people producing that they call priors, and is it possible that we are just encoding more-or-less arbitrary beliefs, as opposed to genuine information?

– the subjective nature of prior construction seems to provide an easy avenue for essentially whimsical, persistent disagreement and a lack of consensus even among experts considering the same decision problem. Perhaps more to the point: it is in principle possible for us all to experience objectively very similar kinds and quantities of information about some parameter of interest, but — as a result of cognitive biases (particularly “overconfidence”), cultural differences in information processing, what we ate for breakfast on the day of the prior elicitation, etc. — produce very different, strongly stated, and contradictory prior distributions. Maybe the only meaningful answer to this concern is that a slowly growing research literature will, we hope, gradually make us aware of all the most common and problematic cognitive and computational biases that enter into prior construction?

– how do we safeguard against self-and-other-deception in prior elicitation?

p.s. Point of clarification: I am the “disconcerted Phil” above, but I am **not** Phil Nelson from Andrew’s original posting. I’m just a random grad student in the mathematical sciences; I don’t think this led to any confusion above, but it just occurred to me now that it might.

• p.p.s. Sorry for the lengthy reply! Tried to do justice-with-more-words to some of the rather vague “But it’s so subjective!” sense of concern.

• Phil:

Priors are subjective. So are likelihoods. So is the choice of what data to include in your model, etc. We need to work at all of these.

• Phil:

I also find the topic of prior elicitation very interesting in the context of behavioral economics, etc. I think elicitation was a bigger topic in the 80s 90s when expert systems were big. Now mostly trained on (big) data.

One thing I recall is experts think intuitively about a problem — apparently that is what becoming an expert involves –, which makes it actually very hard to elicit priors from them (the conditional probabilities are wired in “gut feeling” to put it crudely). There are some interesting techniques to get around this.

A bit pedantic but you might find this book interesting: “Uncertain Judgments: Eliciting Expert’s Probabilites”. Maybe Andrew has other suggestions along these lines.

• Anon,

Thanks, I’ll check that text out! Although my advisor’s steeped in it, I’ve not yet had much reason to very thoroughly explore the elicitation literature.

To expand on the behavioral econ examples: I’m told that approaches have been developed for eliciting priors under assumptions about preferences under risk like those, for example, of original (1979) prospect theory, but I also know that that version of prospect theory assumes the existence of subjective probabilities in the first place, and so it seems the primary problem in using it is to estimate the model in such a way that you can solve for the probabilities that you’ve assumed exist. By contrast, later models — ones that try to explain the Ellsberg paradox — don’t assume the existence of subjective probabilities at all, replacing them instead with less probabilistically coherent ideas. If we were to take these models seriously as a description of a person’s beliefs, it wouldn’t seem that there *is* even a prior to be elicited in the first place. It is almost as if the person performing the elicitation procedure is, in doing so, creating the prior — that the prior was constructed by virtue of the bayesian procedure needing a prior, not because it had any subjective reality from the get-go.

Maybe there’s some coherent way to model that last bit, the “creation of the prior;” I dunno. But it seems to me that assuming that priors are out there in peoples’ heads from the beginning, waiting to be elicited, contradicts a lot of what we’re learning about peoples’ preferences under uncertainty, and that seems like an important worry if we’re going to get people to create priors for us.

• Galileo or one of his students observed the phases of Venus with a telescope before 1611. The observations were incompatiable with a geocentric theory. Remarks about this were published in 1613 in the Letters on Sunspots. This information is drawn from Stillman Drake’s Discoveries and Opinions of Galileo which includes the texts of a number of Galileo’s publications including the Letters on Sunspots.

• Anon:

1. The example I gave involved random sampling to draw an inference about the general population. This is mathematically equivalent to using randomized experimentation to draw an inference about alternative treatments.

2. My impression is not that the church authorities said that Galileo’s inference had low probability but rather that they considered his hypothesis to be illegitimate. So, no, I don’t consider their rulings to be an example of Bayesian inference.

Finally, you write that “anecdotes may be helpful but not determining.” I gave an example (also known as an anecdote) because that’s what my correspondent asked for! I do think examples can be helpful; indeed, I’ve put hundreds of them in my articles and books. Other researchers can look at operating characteristics, that’s fine, but I think that my piles of examples have value too!

• With regards to Matehmatical equivalence:

Agree, but I think there are some nuances. In a randomized experiment I can place a strong prior that, for example, average outcomes would be the same in expectation under the null of no effect _by design_ (i.e. control potential outcomes are same in expectations for rats assigned to treatment and control groups) . IOW the prior might come from the design itself, there are no issues of where the prior information is coming from.

In the ransom sampling from a population experiment, I can say by design, that the sample average will equal the population one in expectation say. But there is nothing in the design that tells me what the magnitude of that quantity should be. That comes from the prior, and then the issue is where that prior is coming from. Of course, if priors are always infallible then there is no problem.

With regards to the Church:
I agree that it is a very contrived example. But the point was again to raise the question of (a) where priors come from and (b) the possibility that they may be totally wrong. But it appears you and others are suggesting this is impossible. I am legitimately not sure whether this is so by definition, induction, or deduction.

With regards to examples:

Examples are very useful, I agree, but when it comes to universal claims of the sort “priors are always (partially) informative, there is nothing to be lost by using them” I want proof.

• PS I should add that I for one think we should be using a Bayesian approach to experiments (and everything else really). I am actually a great supporter of the approach. But I think Phil raises an important question, and I am trying to understand it, not dismiss it.

Typically we run experiments to relax as many assumptions as possible. That is the point and the evidentiary standard – for better or worse. That is why I would do both types of inference, otherwise people will always come at you and say “what happens if you analyze it with a different prior?” etc. and if inferences change substantially, then we get into debates like this one about the nature of the prior.

So I am not advocating Fisher vs Bayes. I am just trying to understand the controversy.

4. My vague impression is that R.A. Fisher’s approach was hugely useful in the short run, by providing cookbook methods that most people with 3-digit IQs could follow. In the long run, however, his methods — precisely because they were good enough for government work (and many other kinds of work) — seem to have retarded the growth of statistical sophistication. Our culture really should be father ahead by now in our understanding of statistical thinking.

• The obverse is that the ubiquitous availability of powerful point-and-click statistical software has made most people with 3-digit IQs reach way beyond their highest level of incompetence.

Our culture needs to get back to basics, before it dwells on sophistication.

Comments are closed.