# “Bayesians (quite rightly so according to the theory) . . .”

Stephen Senn writes, “Bayesians (quite rightly so according to the theory) have every right to disagree with each other.”

He could also add, “Non-Bayesians (quite rightly so according to the theory) have every right to disagree with each other.”

Non-Bayesian statistics, like Bayesian statistics, uses models (or, if you prefer, methods). Different researchers will use different models (methods) and thus, quite rightly so according to the theory, have every right to disagree with each other.

## 50 thoughts on ““Bayesians (quite rightly so according to the theory) . . .””

1. Well, taken another way, Bayesians actually can never disagree with one another — if they accept that all inferences are conditional. Every Bayesian has to agree with the results of a particular application of Bayes’ theorem, right? What they can disagree on is what to condition on, but then Senn’s statement is incorrect, because it is not “according to the theory”. But what to condition on is actually outside the theory proper, and this yields an analogous sore spot for all kinds of inference, as you point out.

2. Senn has also stated:

“2 necessary conditions for a prior distribution 1) you’d bet on it before having any data 2) no amount of data would cause you to change it”

which is as pure an illustration you’ll ever find of the old adage:

“When a Frequentist talks about Bayesian Statistics, you’ll learn a great deal about Frequentists and less than nothing about Bayesian Statistics”

– Is it not illegitimate to say, ‘Having seen the data, I’d actually choose something different if I went back and didn’t have the data again’? (Or is it less hypothetical: ‘I will actually choose something different’)
– If so, are priors really describing prior belief or just being used as a device to work out which prior you really want and then run your analysis?
– When / how might data lead you to change your prior – if there was a strong disagreement between them? Would a revised prior always be closer to the data?

• I think you can learn a lot about so-called ‘Bayesians’ depending on how they react to this statement of mine regarding prior distributions. To the extent determined by the model the prior distribution and the data are exchangeable. So changing the prior having seen some data is like changing some of the data having seen the rest of the data, something a Bayesian ought to find peculiar, and De Finetti, amongst others, has pointed out is not allowed. Of course this should not be confused with updating the prior to a posterior. That is quite another matter but not the point here.

• Stephen:

What’s parallel in the posterior distribution is not “prior distribution” and “data” but rather “prior distribution” and “likelihood.” When you talk about “changing some of the data” you should talk about “changing the likelihood,” which can be done by changing the model of the data. As I’ve written a zillion times, I think that recognition of the (partial) arbitrariness of the prior distribution should be paired with recognition of the (partial) arbitrariness of the likelihood or data model.

• Thank you Stephen and Andrew.

Of course, I wasn’t thinking about changing the likelihood on seeing the data (I imagine De Finetti also was not). Stephen, can you point me to a reference for De Finetti?

Andrew, would you agree with Stephen’s second statement if seeing the data did not lead you to change your model? Or, when you talk about ‘(partial) arbitrariness of the prior distribution’, are you also saying you would regard it as legitimate to change the prior even if you didn’t change your model? I can’t imagine this going down well with (say) clinical trial regulatory bodies!

Thanks again.

• Tim:

I think the prior distribution is part of the model. In basic statistical theory, Bayesian or otherwise, the model is fixed (or it is chosen as part of some larger fixed space of models), but in the real world we recognize that our models are approximations, and it is standard practice to start with a simple model and then add complexities to the extent these are forced upon us from the data or from our later goals. We also do the opposite sometimes, starting with a complex model and simplifying it for reasons of computation or communication. We’re always going back and forth between models, data, and uses of the fitted model.

• Mayo:

As always, I think it’s fine to change any aspect of a model. There’s nothing special about the prior, and I am uncomfortable with discussions that present the prior as subjective and the data model as God-given.

• I don’t know what type 2 rationality is, but Bayesian style reasoning clearly tells you that if you have any uncertainty at all about what model to choose, then after seeing how well a model fits you should sometimes go back and choose one of the other plausible models:

http://models.street-artists.org/2016/04/01/how-updating-the-model-prior-or-likelihood-or-both-can-be-interpreted-as-a-necessary-consequence-of-bayesian-reasoning/

The only difference is that rarely are we able to formalize the probability functions involved. Still, we use plots and statistical calculations that describe the goodness of fit, and “gut instinct” that acts like a probability function over the results of those plots and procedures, to sometimes reject the idea that the first thing we wrote down was the right thing. Changing your model after seeing that it fits poorly is required by Bayesian ideas of rationality.

• Chapter 11 of the second volume of de Finetti covers this. See in particular page 211 “Nothing can oblige one to replace one’s initial opinion, nor can there be any justification for such a substitution.”

• De Finetti is not the be-all and end-all of Bayesian statistics. Have you read Jaynes yet?

• Andrew,
The qualification I gave ‘exchangeable to the extent described by the model’ is important. A simple example as to what I mean is given by using a beta-conjugate distribution for the estimate of a binary probability. If this is informative you can replace it by two things, an uninformative paleo-prior and some subsequent pseudo-data. Given the model these pseudo-data are exchageable with the real data that you then collect and you can’t have it both ways. You can’t use the pseudo-data as a stabilising influence for most real data sets you might collect but reject them for some cases you don’t like. If you do the whole set-up was wrong in the first place. This was the essence of my comment on Geelman and Shalizi some while back
http://bacbuc.hd.free.fr/WebDAV/data/DOM/BayesTheo/Senn-BJMSP2013.pdf
and is also covered in my blog-post on Deborah Mayo’s site.

This comment should not be taken as a denigration of applied Bayesian anyalysis but rather as a criticism of the argument that a theory of how to remain perfect is necessarily the best recipe for becoming good

However, I agree with you entirely that the choice of likelihood is also (partly) arbitrary, although in frequentist accounts design is closely related to the model used or to be used and this helps in aligning likelihood and data.

• Stephen:

1. Regarding your first point: Yes, information is information. If you put information into your analysis, and the results do not make sense, than this implies some problem somewhere. “Not making sense” is another way of saying there is a contradiction with some prior information (your belief as to what “makes sense”), and this suggests there should be some realignment, some reassessment of your prior belief or some change in the model (which could be a recognition of unforeseen problems with your data), or some combination of these.

2. You write, “in frequentist accounts design is closely related to the model used or to be used.” Traditionally, yes, for example in Snedecor and Cochran and other works of that vintage. But I think that in more recent textbooks and more recent practice, design is not taken so seriously, perhaps because statistical models are typically used outside the world of clean designed surveys and experiments. My impression is that data models and likelihoods are typically chosen based on convention and are often not seriously examined.

• Stephen:

My 2005 paper on Anova is relevant to this discussion. See in particular Figure 1 of that paper.

• Stephen, in the blog post you link to you wrote:

> The problem is thus that prior distributions (…) are infinitely informative about themselves.
and
> the prior distribution (…) has information equivalent to having see an infinity of other drugs and that is the problem.

I agree that “to deal with that requires a higher level of the hierarchy”, but I don’t think you can say that the prior is “infinitely informative about itself.” The level of certainty about the prior is simply not determined and it could be anything. In that additional level you could have a very concentrated distribution of priors or even a degenerate distribution where the prior is defined exactly. But the very same prior could also be the result of averaging over a very broad distribution of potential priors.

• Senn your comment tells me a great deal about you. In particular, you seem to think people’s philosophical stylings are more important than what the equations actually say. So let’s take a look at the equations and see what they do irrespective of anyone’s philosophy, or opinions, or beliefs, or wishes, or hopes, or dreams.

I stress that for everything that follows the only things I’m using are the basic sum and product rules of probability theory which all statisticians agree.

Suppose you have a biologist has a new fly with one of potentially two mechanisms controlling their life span. Mechanism M1 allows them to live 1-7 days. Mechanism M2 allows them to live 7-14 days. You’re interested in estimating a parameter lambda related to their metabolism. Then given the data you get:

P(lambda|Data) = P(lambda, M1|Data)+P(lambda, M2|Data)
=P(lambda|M1, Data)P(M1|Data) + P(lambda|M2, Data)P(M2|Data)
=P(Data|M1, lambda)P(lambda|M1)P(M1|Data)/P(Data|M1)+P(Data|M2, lambda)P(lambda|M2)P(M2|Data)/P(Data|M2)

So if the Data looks like LIFE_SPAN = 1day, 2 day, 3day and so on you’ll get P(M1|Data)=1 and P(M2|Data)=0. If the Data looks like 14days, 8days, … then P(M1|Data) = 0 and P(M2|Data)=1. Thus depending on how the data works out this reduces to,

P(lambda|Data)= P(lambda|Data, Mi)P(lambda|Mi)/P(Data|Mi)

Where which Mi (=M1 or M2) is chosen depends on the Data. Thus the form of the prior P(lambda|Mi) is decided after seeing the Data.

I stress this is just one example of hundreds I could create and uses nothing but the same basic equations used to derive Bayes theorem. So you see it doesn’t matter how many De Finetti quotes you come up with or your silly understanding of Bayesians. Whenever there’s doubt about the model, anyone who is Bayesian enough to use Bayes Theorem the way Bayesians do will sometimes correctly pick priors after seeing the data.

Those are the mathematical facts.

• You could use a simpler example: a bucket of paint which is either blue or red. You want to determine the colour. Given the colour of your hand after you put it in the bucket: p(paint|hand)=p(hand|paint)p(paint)/p(hand)
With a reasoning like yours, you can distinguish the two scenarios M1=”red paint” or M2=”blue paint” depending on the colour of your hand to use either
p(paint|hand)=p(hand|paint,”red paint”)p(paint|”red paint”)/p(hand|”red paint”)
or the “blue paint” alternative. So if you see your hand red, you can use the prior “the paint is red” and otherwise “the paint is blue”.

It is also easy to prove that in general instead of doing posterior = likelihood * prior
you can do posterior = modified_likelihood * modified_prior_correctly_picked_after_seing_the_data
where modified_prior_correctly_picked_after_seing_the_data = likelihood * prior
and modified_likelihood = 1

All of which are mathematical facts that I’m pretty sure Stephen Senn understands.

• Carlos: the thing is, it’s often the case that before seeing any of the data you don’t know what you’re going to be analyzing, and after seeing just a little of the data, you know a lot about what you’re analyzing. That is, by just knowing the subject that your data is relevant to, you can pick specific types of models. Sticking with the idea of paint:

A person sends a canvas to you painted by famous painter Foo who always paints portraits in oil, and landscapes in Acrylic. They want you to use fancy radiographic techniques to figure out what this painting is painted over….

The canvas arrives wrapped in packing material and with the distinct smell of oil paints. You can now bring to bear on this project a diary you have by painter Foo which lists every person he ever painted a portrait of.

You arrive at this prior over possible subjects by “looking at the data” (namely, smelling the oil). The fact that it can be put into a bigger model where you might have had a landscape and you might have had a portrait MEANS that “smelling the painting” and THEN choosing your prior over subjects actually IS consistent with Bayesian reasoning. It is, in fact, the whole topic we just got finished discussing. You condition your prior on “smelling oil” and you also condition your likelihood on “smelling oil” and you move on to trying to match the radiographic results to photographs of the people this person painted.

• Daniel, as you said we have just discussed this issue and we agree, so I don’t know what’s your point. What you expose is consistent with Bayesian reasoning and is covered, I think, by Senn’s caveat to his remark about modifying priors: “Of course this should not be confused with updating the prior to a posterior.”
Because that’s what’s going on when you condition on part of the data: you get a posterior combining “part of the data” with “prior”, and this posterior becomes “new prior” that you combine with “rest of the data” to get to the final posterior (and the result is the same that you would get combining all the data with the prior in a single step).

3. This is taken entirely out of context so as to miss the reference. Senn was referring to something that nearly all Bayesians appear to agree on (alert me to any exceptions, I know just 1): that p-values “exaggerate” the evidence against the null. My post on this with links to Senn* is here:
http://errorstatistics.com/2016/01/17/p-values-overstate-the-evidence-against-the-null-legit-or-fallacious/

Of course the argument rests on assuming a p-value smaller than your chosen posterior in the null is exaggerating the evidence against (other versions are based on Bayes factors), and Senn’s point is that it’s only with the spiked priors to point nulls that the “exaggeration” appears. With a less biased prior, as Casella and R.Berger (1987) show, the difference evaporates.
Senn remarks: “to therefore dismiss P-values
‘…would require that a procedure is dismissed because, when combined with information which it doesn’t require and which may not exist, it disagrees with a procedure that disagrees with itself.’
http://errorstatistics.com/2015/12/12/stephen-senn-the-pathetic-p-value-guest-post-3/

*I also link to Casella and Berger(1987) and Berger and Sellke (1987)

4. It is often practically impossible for non-statistical collaborators to fully understand the judgment calls that need to be made in selecting one statistical approach over another. This means that if a manuscript only presents the favored approach of a particular statistician, the scientific conclusions drawn risk sensitivity to which statistician happens to be on the research team.

One way to overcome this is to perform sensitivity analyses, pursuing a variety of approaches corresponding to the reasonable range of opinion among statisticians, but in many situations this range is simply too wide to be adequately explored. Another is to simply view the research as a joint product of the statistical and non-statistical members’ intellectual contributions, and hence view such sensitivity as on par with the sensitivity to which PI happens to be leading the project. Still, many non-statistical collaborators seem to view differences of opinion between statisticians as differences in competence, and it is often difficult to convince them otherwise.

• > many non-statistical collaborators seem to view differences of opinion between statisticians as differences in competence
Agree – that may largely arise from a view that there are known and agreed ways to convert uncertainty to complete certainty (at least in statistical output/conclusions) by competent statisticians. Especially if folks have had that one introductory stats course.

Some general thoughts on how methods are/should be chosen here – http://www.stat.columbia.edu/~gelman/research/published/authorship2.pdf

• Sometimes it actually is a difference in competence. The problem is that the ‘non-statistical members’ will have no way of assessing even the sign of the competence.

5. Andrew,
Thanks. I will check your 2005 paper out. In the meantime do you have some sort of blog nursery that the toddler trolls could be invited to play in while the grown-ups have a conversation?

• Stephen:

This is the blog nursery! If you want to see real trolls, check out the comments at Marginal Revolution!

• Stephen:

The blog nursery may likely be a necessary evil for the wider discussion of the ideas in your accusation of modelling being a murder and Andrews ANOVA paper.

One of the many thorns you deal with that bugs me the most is random effects being put on mu in meta-analysis – I regularly lose that argument most of the time with other statisticians. I read Andrew’s ANOVA paper 6+ plus times when I was doing my thesis and my guess is that its one of his least influential papers.

Just ignore any comment that seems not worth responding to.

• Keith,
It is nice to see Nelder recognised in Andrew’s paper. Do you know if there is any package in R that has the equivalent of the GenStat approach of 1) declaring the block structure 2) declaring the treatment structure 3) declaring the design matrix and then 4) having the ANOVA form follow automatically?

I will need to devote more time to Andrew’s paper but recent work I have been doing on n-of-1 trials has made me appreciate how strong the distinctions can be between a) establishing if there was a treatment effect in the patents studies and b) predicting what it might be in future patients. Fixed effects met-analysis relevant for the former and random effects for the latter.

• We’re all trolls Stephen, the only difference is you waste taxpayers money being wrong, and I waste my free time being right.

• No. An important difference is that I don’t hide behind anonymity but you are too cowardly not to. Prove me wrong by admitting who you are or prove me right by posting another rude anonymous comment.

• Why the obsession with identification? Does the man matter or the ideas?

I’ve gotten excellent suggestions & responses from totally anonymous posters on USENET, stackexchange etc.

OTOH, my Facebook feeds are full of non-anonymous idiots spouting drivel. I strain to think how it would help me if I knew “Laplace”‘s real name.

• I think it is possible to provide comments without being insulting. However, since I am forthright, but (I maintain) not insulting I have always signed my reviews for statistics journals (I have done many mane reviews over many years) where the journals have allowed me to do so. Statistics in Medicine is such a journal and it’s a policy of which I approve.

• My name is Joseph Wilson. I was a Captain in the Marine Corps and spent 3 years in the Sunni Triangle of Iraq. For example in 2004 I was on the South West side of Fallujah from the First Battle of Fallujah up to the second one. Now I’m not the most courageous Marine that ever lived, but I’m endowed with about a thousand times more courage that you.

Look back through my comments. I stuck to the technical facts. Your the clueless idiot making rude comments.

• I knew as soon as he adopted the new disguise, was warned because of a previous, nontrivial, mishap. The truth is I was actually relieved to learn he hadn’t died or been taken away (after that cartoon and all). Unfortunately, he presents an obstacle to my commenting on this blog. (I came today because Senn mentioned he replied.) Please don’t respond.

• Don’t let me stop you from you commenting Mayo, it’s still marginally a free country (thanks to people like me; academics intact bans against free speech seemingly every chance they get).

Seriously, Senn has no monopoly on being wrong, so speculate away! I swear I wont be so rude as to provide proofs of counter examples to your claims like I did for poor old “super brave” Stephen.

• A surprising revelation and a perfect put down! I agree with Taleb’s assessment (eg Antifragile) that soldiers deserve (far) more respect than academics. So, Respect!

• Goodness knows I’ve had my differences with Laplace, but the notion that cowardice is among his flaws is laughable to those of us familiar with him. If you have any doubt, I’ll attest to the fact that he is who he claims to be in his comment below.

• Or, y’know, above.

• +1

I don’t agree with Laplace’s posts a lot but I don’t think anonymity has anything to do with the point in question here.

6. This was to reply to Daniel Lakeland, there was no room underneath. Good saw type 3 rationality as informal considerations “coming to the rescue of the formal” to use a Senn phrase. The issue of prior change and model revision in Bayesian inference is clearly one of the topics that the ASA should subject to its “balanced scrutiny” of Bayesian inference, among “other” methods compared to p-value reasoning–assuming they go through with this. It wasn’t listed though in this particular discussion of precautions today: http://errorstatistics.com/2016/04/01/er-about-those-other-statistical-approaches-hold-off-until-a-balanced-critique-is-in/

• There is no question that almost in every case when I sit down to analyze some experimental or observational data, I have some area where I have obvious significant uncertainty *about the model*. People who write models frequently know the feeling. You ask yourself “should I put in a parameter here that represents such and such?” or “do I need to account for the way this data was collected by assuming it has an unknown bias or that measurements are related through time?” or “will it be good enough to impute these missing data points as XYZ?” etc etc.

But, I also usually have information about how certain aspects of the model are likely to affect the way in which the model fits the data. So for example I might model a series of data points as independent draws from a distribution, or I might model them as a time-series. When I go from one model to another, I expect the time-series to improve the fit of the model relative to a metric that compares points that are nearby in time, when the timeseries view is appropriate, and to actually make that fit worse when the timeseries view is inappropriate.

It’s clearly impossible to put a formal probability distribution over every model you might ever consider fitting, even though it’s clearly a finite set as we are finite beings living for a finite time. It’s just too inefficient to think up a bunch of models we currently believe are very low probability (aliens came to earth, the CIA paid the experimenters to lie, the instrument was calibrated in feet but labeled in meters, the PI went on vacation and made up all the data due to extreme funding pressure). But humans have a tendency to be able to pick out a few possibilities and run with them until there is an indication that something is wrong.

The existence of an “indication that something is wrong” implies an expectation of “what looks right”, in other words a Bayesian probability distribution over various metrics for goodness of fit.

The usual route to an analysis is to try out something in the high probability region of model-space, and then see if it makes sense, that is, compare how it fits the data to your expectations for how good a model with that level of complexity should fit the data if it is a correct (or at least a good) model. When the expectations are not met in some way, it indicates DATA which we can use to further refine our inference *about the form of the model*. The concept is clearly Bayesian, even if the mathematics is not carried out formally in a computer.

Can it go wrong? Certainly. Just as any search for truth can go wrong with sufficient reason. But, it’s absolutely meaningless to say that somehow “the prior is infinitely informative about itself” and it can never have a logical reason to change. That simply ignores a whole layer of uncertainty over the model choice which is real and palpable to the people who write the models.

A Bayesian analysis is always conditional on the model being a good approximation. “IF THINGS ARE LIKE M1 then given the data, we discover parameter q is close to 1.0 (or whatever)” but most Bayesian working on realistic problems are not going to be dogmatically SURE that it really is true that “THINGS ARE LIKE M1”

• Box gave the TL;DR summary years ago: All models are wrong but some are useful. Applies to both Bayesian and frequentist as well as physical models.