Priors are important in Bayesian inference.

Some would even say : ” In Bayesian inference you can—OK, you must—assign a prior distribution representing the set of values the coefficient [i.e any unknown parameter] can be.”

Although priors are put first in most expositions, my sense is that in most applications they are seldom considered first, are checked the least and actually fully comprehended last (or perhaps not fully at all).

It reminds of the comical response of someone when asked for difficult directions – “If I wanted to go there, I wouldn’t start out from here.”

Perhaps this is less comical – “If I am going to be doing a Bayesian analyses, I do not want to be responsible for getting and checking the prior. Maybe the domain expert should do that or just accept the default priors I find in the examples sections of the software manual”.

In this post, I thought I would recall experiences in building judgement based predictive indexes where the prior (or something like it) is perhaps more naturally comprehended first, checked the most and settled on last. Here there are no distraction from the data model or posterior as there usually isn’t any data nor is any any data anticipated soon – so its just the prior.

Maybe not at the time, but certainly now I would view this as a very sensible way to generate a credible Bayesian informative prior that involved intensive testing of the prior before it was finally accepted. Below, I am recounting one particular example of this I was involved in about 25 years ago as a prelim to investigating in later posts what might be a profitable (to a scientific community) means to specify priors today.

We did write up the methodology at the time but I think I can give enough description of it by recalling one of the more interesting applications: developing a predictive index for a children’s aid society in order that they could predict whether a judge would find for child neglect (child to be put into foster care). Going to court to get an intervention was very expensive and the children’s aid society wanted to have as good a sense as they could of likely being successful before proceeding.

The process involved identifying and recruiting a group of experts that adequately spanned the knowledge around child protection and the court’s involvement in child protection. It comprised of social workers from children’s aid societies as well as lawyers and judges with experience in child protection cases. Prior to the first meeting they were interviewed by a consensus group facilitator to ensure they understood the task, were qualified and willing. In addition, the group facilitator tried as well as they could to get each individual expert’s sense of the dimensions (variables) such an index should have and how each dimension should be graded and an overall score discerned from these. The consensus facilitator then tried their best to form a naive consensus index using all the group members individual input as a way to start the first day of a two day meeting. On the first day, the experts would jointly discuss their views on the facilitator’s admittedly amateur attempt to form a consensus index and they would try to make improvements on it. (The critical work of the facilitator was to ensure the work was jointly and severally done by all experts without the usual separate cliques of members spontaneously forming to do battle for their superior view of what would be best.)

It was really important to get a credible, even if very tentative, consensus index by the end of the first day as it was used to generate fake cases over night for the group members to review and argue over giving their individual subjective judgements of how likely such a case would actually lead to a judge making a finding for child neglect. The subjective judgement scores would then be compared to the score based on the tentative consensus index. Then all was revised, variables, scoring of the variables and their subjective scores. This review and scoring of fake cases also helped the group members reflect on how common it was for cases like these occur and whether there were common cases that did not come up in fake cases generated. As the meeting progressed, the cases would be continually be recalled for further discussion, revision of subjective scores, variables and scoring of the variables as well as possible modifications to or creation of new fake cases to ensure coverage of all cases likely to arise in practice. A unique incentive was used to get completion of the tentative consensus index by the end of the first day: dinner was provided on site and not served until a tentative consensus index was agreed upon. There was not the same draconian incentive on the second day and consensus at that point was later to be confirmed by email shorty after the meeting (within a week or so) to allow for second sober thought.

The resulting index (as best I can recall) involved five dimensions of variables – providing shelter, food, clothing, education and home supervision with levels one to five, one being full provision and five being essentially no provision. The index scoring rule was simply, a judge would likely find for neglect if there was a five on any one dimension or three or more fours on any combination of dimensions and would not otherwise. An apparently sensible and credible predictive index in the eyes of the group of experts that developed it along with some additional colleagues they could coerce into looking at it. (Unfortunately I have no idea what happened with it.)

Again, I am not sure if I thought of this a careful specification and testing of a prior at the time but I would now. The testing out priors phases was extremely important. Some people who were involved in this type of judgemental predictive index work would regress the finalised subject scores given to the fake cases against the finished index variables to improve the index weights or coefficients. I was completely opposed to this at the time as it seemed like double use of data – the experts used these scores to mentally revise the weights and then an analyst was re-using them to revise the weights. Not sure what i would do today – maybe just partially revise them using some fractional power of the likelihood?

I found the work very enjoyable and was disappointed it did not continue to be done within my group. It was though a very expensive way to generate a prior (say $20,000 plus)!

“I wouldn’t start from here” (joke) “‘Tis the divil’s own country, sorr, to find your way in. But a gintleman with a face like your honour’s can’t miss the road; though, if it was meself that was going to Letterfrack, faith, I wouldn’t start from here.”

So this is my current fairly vague prior on what might be a profitable (to a scientific community) means to specify priors today.

Keith, it’s an interesting scenario, and it goes beyond what is usually thought of as specifying a prior. Your example shows nicely how “prior knowledge” also has to inform the likelihood. To the extent you used “fake cases” to then figure out what the likely outcomes would be, and then back calculating from that implied reasonable values for the weights, it’s more or less specifying a joint distribution:

p(Outcomes, Parameters, FakeData | Knowledge)

the form of this probability distribution involves both a choice of something that looks like a likelihood p(Outcomes | Parameters, FakeData, Knowledge) and something that looks like a prior p(Parameters | Knowledge, FakeData) as well as a kind of meta-prior p(FakeData | Knowledge) which is to say, your knowledge informed your choice of what was realistic FakeData to use in building your model.

So, taken more broadly, the choice of likelihood is itself the application of prior knowledge, and your example illustrates this nicely.

In fact, thinking about this a bit more, here’s an analysis of what was going on:

At its simplest, you could say that your goal was to specify prior probability over the weights needed in some kind of nonlinear regression over the 5 dimensions of shelter, food, clothing… etc As well as to specify a mathematical form for this nonlinear regression (a likelihood).

The domain experts had a fairly good sense of what the outcomes would be in any particular case, but very little ability to do mathematical modeling (that is, to translate their internal ability to predict into a formula for the likelihood).

A mathematical modeler could probably pick out some kind of mathematical form like a nonlinear logistic regression, but couldn’t assign numbers to the parameters to make things realistic without help from the experts.

The experts on the other hand, not having mathematical modeling skills, couldn’t just directly assign numbers to the parameters either (no understanding of how to make priors).

Some of the experts however, had an ability to give examples for what *kinds of cases* there were, that is, to sample from a prior over cases.

So, we construct a prior over the regression parameters by doing the following:

Construct a set of training data as follows: repeatedly

1)

a) Choose a case from among the high probability cases. This implies an informed prior over the kinds of cases.

b) Have the experts evaluate the outcome in the case. This implies they already have an ill-specified but fairly specific internal prediction “formula” or “method”.

a and b then turns their internal thoughts into a set of observed predictions (Data! about their Thoughts!).

2) Implicitly choose an enormously broad prior over the parameters for your nonlinear regression (say, uniform(-10^30000000,10^30000000) a perfectly well defined proper prior that includes every number that could possibly come up in almost any real-world problem )

3) Do Bayesian inference on the parameters in the nonlinear regression by calculating a posterior distribution for the parameters using the fake data and the enormously broad implicit pseudo-prior mentioned above.

In essence you’re turning the problem of specifying a Bayesian model for the process into a Bayesian inference problem on observed example data.

This means your Bayesian model isn’t inference about *what actually happens* (ie. data observed from court outcomes) but rather is a Bayesian model of *what the experts think will happen* which is exactly what a prior (and a Likelihood!) is supposed to be, a model for what you think is likely to happen in any given outcome.

I think this nicely illustrates what Bayesian inference is about, and how it’s different from a Frequentist analysis in which your goal is to characterize which random number generators from within some pre-defined set might have produced actual observed data from the real world.

An ABC (Approximate Bayesian Computation) scheme for learning an analytical approximation to a model where you can only simulate data!

Compare this to a scheme where you’re trying to figure out how to predict when a mechanical part will crack based on observed damage. You have a fairly good idea of what kinds of damage are observed in the wild, but you can’t predict failure without running a large PDE simulation for the material. You’d like to predict probability of failure from some kind of more simple model. So, you generate a bunch of simulated damaged parts, run the PDE simulation on the parts, get the predicted failure loads, and then use an enormously broad prior on the functional form of (damage,load) -> (Failure Frequency) and do Bayesian inference on the example data to get a specific range of functions that predict with uncertainty.

In your example, the fake cases and my fake parts play the same role, and the expert’s predictions for court outcomes and my PDE black box simulator play the same role.

This technique is broadly applicable across a lot of situations. It’s important to distinguish though between doing an analysis of what actually occurs (say having real world court cases, or real world failed bridge parts) vs doing an analysis of what some expert predicts will occur (say the experts predictions, or the idealized PDE for idealized materials and loads).

At the risk of complete heresy: the prior distribution can also be thought of as a null hypothesis. If you have the marginal prior for one of the model’s parameters, you can then ask about the “p-value” that the observed data actually came from that prior distribution.

Not to suggest anything like a 5% test, but sometimes this calculation can identify that there is an extremely remote chance that the data could have come from the prior. That should mean that the data itself was surprising in some way. In my experience, this is often due to errors in the data collection rather than in the definition of the prior.

The data doesn’t come “from the prior” but rather from some data prediction distribution given some value of the parameter chosen from the prior.

p(Data | Parameter), and then Bayesian inference proceeds by combining the prior probability and the data probability p(Data | Parameter) p(Parameter)

so it’s possible to choose a posterior sample for the Parameter and then ask what was the probability for that under the prior… and if its low, perhaps your prior was based on faulty reasoning.

Keith has linked to some research on “relative surprise” in the past which is more or less using this fact that sometimes the likelihood picks out something that is improbable under the prior.

But at its core, your suggestion of looking at how probable the posterior high probability values are under the prior is a good check on whether your thinking about the problem is realistic, and/or what’s going on with the model.

I’d very much hesitate to say that you should re-concentrate your prior in a different region of space post-analysis, but it might make sense to broaden your prior out a lot if you started concentrated in some region of space and afterwards you realize that you had been overly confident or had forgotten to take something into account.

I work in microbial ecology and we are slowing moving over to Bayesian analysis in our lab. One problem I keep trying to wrap my head around is how to pick our priors and test them before they hit our wall of data.

Picking priors is in practice rarely all that hard. I think people have this impression that there’s some “correct answer” and you need to “get close” to that answer. This is a mistaken belief. The only “correct” answer is the one that encodes what little information you really do have.

Often people have at least an order of magnitude guess or can guess a range for the logarithm (say the parameter is 10^3 to 10^6 so the base 10 logarithm is somewhere in 3..6). If that’s all the info you know, then that’s what you put in your prior.

I can put a prior on your height in cm of exponential(1.0/170) and I’m sure that your actual height is in the high probability mass region and I don’t even know who you are, whether you’re male or female, or if you are or are not an NBA player. My prior covers all NBA players. But I know more than that, I could probably do gamma(3,3.0/170) since I know you’re not 0,1,or even 10 cm high and the tallest person ever was probably what 300cm? (check wikipedia… Robert Wadlow 272 cm)

A prior on the diffusivity of a microbial toxin in room temperature water is somewhere between 0 and the diffusivity of the oils in an orange peel in air (which I estimate is probably several m^2/s just based on how fast someone at the other end of a room smells an orange peel that I squeeze).

The fraction of things that have some property is logically bounded to be between 0 and 1, so uniform(0,1) covers it if you know nothing else…

The biggest problem is getting past the idea that there really is a *correct* prior shape and you’re trying to get as close as possible to it. There *is* a correct parameter value, and you’re just trying to make sure your high probability region of the prior includes that value.

This is the most helpful explanation I’ve heard, thank you.

That is a really helpful example of the kind of thought that goes into choosing appropriate priors. But do you (or anyone here) know of any papers or tutorials off the top of your head which explain some specific prior choices in detail? In particular I’m thinking about things like choices about different distributions and some sort of (posterior predictive?) checking procedures for prior sensitivity.

There’s a bunch of stuff about maximum entropy diffusely spread around the literature. The basic idea of maximum entropy is to spread your probability out as much as mathematically possible while retaining certain facts. For example, the maximum entropy prior for a positive parameter conditional on the average for that parameter being x is exponential(1/x). The maximum entropy prior for a number conditional on the average for the number being m and the spread around that number being s, where the measure of the spread is the standard deviation is normal(m,s). If instead of the standard deviation you use the mean absolute deviation then it’s laplace(m,s) (two exponentials back to back)

You can derive a maximum entropy distribution for any set of conditions, but it might be necessary to do so numerically if the conditions are fairly complicated.

In practice, knowing a handful of maximum entropy conditions (like say a 1 page table you might find in a textbook or on wikipedia) and then using them in hierarchical mixtures is probably enough for most purposes. So for example, if you say “well, I know I need a prior where the average is close to m and the standard deviation is close to s, you can do something like normal(m,s) where m is itself a hyper-parameter with prior uniform(m-k, m+k) for some fixed k, and the s is say exponential(1.0/styp) where styp is some guess at about how big s should be.

One question you should ask yourself is something like “how much does it matter?” If you do an analysis and the posterior for a parameter is concentrated somewhere within the high probability region of your prior, over a significantly smaller region of the parameter space than the prior included… it probably doesn’t matter what your prior was, the data is controlling the inference. But if your posterior is concentrated outside the range of what the prior considered, or is not very concentrated and looks a lot like your prior… then you should go back and look at the thinking that went into choosing your prior, and look at the thinking that went into choosing your likelihood, and try to figure out what is going on.

My general inclination is to do a few back of the envelope calculations like those above, pick an appropriate maximum entropy prior from a few that I have memorized, or alter them in some way (I think of the gamma prior as a squeezed version of the exponential prior when I know not only an average but some sort of approximate upper or lower bounds). Then I run my analysis, and if I get something shocking (well outside my prior range) I re-examine things. But then, I also tend to try to build extremely informative and mechanistic likelihood models, and they tend to extract lots of yummy information out of the data, and I tend to use maximum entropy priors which are by mathematical construction pretty broad, so usually the posterior concentrates somewhere reasonable compared to my prior. If you work in a data-poor topic and don’t have much distinguishing power in your likelihood you’ll rely a lot more on a realistic prior.

That’s great. Thanks Daniel, you’re a gentleman!

Thanks — this is informative.

So are you saying the quality of your model is agnostic to the prior choice? So why do people sometimes put in significant effort to elicit a good prior by surveying experts from the field etc?

Because the data they have, and the model they are using (likelihood) aren’t very informative, so their posterior winds up looking a lot like their prior?

If you’ve got decent data and more importantly a likelihood that actually makes fairly specific scientifically based predictions, then you typically find out a lot and the prior matters less. If you’ve got a poorly specified model and badly thought out experiment, or haphazardly collected observational data, there’s not much you can do to make your likelihood inform your posterior.

The part that people tend to get wrong, in my opinion, isn’t the prior, it’s the model of the world that informs their likelihood, and/or the design of the data collection.

Daniel,

Thank you for taking the time to write your response. This makes a lot of sense to me now. I also feel better about the priors we used in our paper. I did a ballpark guess.

Daniel: Thanks for the interesting comments – I will reply tomorrow.

> Your example shows nicely how “prior knowledge” also has to inform the likelihood.

When I was writing this post, I was thinking there was no data and hence no likelihood but regressing the finalized subject scores given to the fake cases against the finished index variables definitely would involve a (fake) likelihood. Here the panel’s works has essentially ruled out likelihoods based on linear models.

> Picking priors is in practice rarely all that hard.

Depends, its harder if you depend on others for the knowledge, if there are numerous parameters, if those parameters are entangled, if there will be less informative (by parameter!) data available. For instance in a meta-analysis there is seldom much informative data available for between study parameters (e.g. study quality, effect variance) and in epidemiological studies there is seldom any informative data available for biases that have to be addressed.

I think repeatedly working on similar and repeated observations helps – and my sense many statisticians (like me) don’t get to do that.

We do need to have default priors and in many cases those will suffice. Some have written about challenges and problems with MaxEnt priors, though the ones you talk about here do seem sensible to me as there are reasonably informed.

> then you should go back and look at the thinking that went into choosing your prior, and look at the thinking that went into choosing your likelihood, and try to figure out what is going on.

That makes sense and I don’t think there will be rules for this – one does worry that _publication hunters_ will adjust the prior to get a most publishable posterior :-). I do think ones comfort/confidence with the posterior should be down graded when this sort of adjustment was found to be needed…

> “how much does it matter?”

For the study in hand, that is something to keep in mind, I do think in the longer run we do want to get better priors (in application areas) if we can so ones comfort/confidence with the posterior can be higher.

> An ABC (Approximate Bayesian Computation) scheme

To me, this is just conditional probability without Bayes Theorem distractions – you have p(theta) and have observed x so p(theta|x) are the relevant probabilities (how finely these are approximated, in principle is not a big deal).

> Bayesian model isn’t inference about *what actually happens* (ie. data observed from court outcomes) but rather is a Bayesian model of *what the experts think will happen*

There is a literature on that – a supra-Bayesian that does a Bayesian analysis of others’ priors – just started to look into that but my sense is that a expert consensus process amongst providers of the priors is considered to trump the supra-Bayesian.

In terms of the joke, sometimes “you simply can’t get there from here” and other times “is just down about any reasonable path you may choose”.

> Picking priors is in practice rarely all that hard.

I should have qualified that as “for someone who is going to use a ‘wall of data'”

When data is scarce, and correct likelihoods aren’t very clear, then yes, you need a lot more work on the prior.

“for someone who is going to use a ‘wall of data’” sounds like a very atypical example. Not representative of most practical problems at all.

In most problems I seem to handle data is almost certainly scarce.

There are lots of areas where the data is so thick that you don’t know what to do with it. the biggest problem is understand how to use it (ie. likelihoods/models/theories of the world are thin on the ground). Bioinformatics, satellite remote sensing, seismology of small events, the American Community Survey Microdata, whatever.

So, it depends a lot on where you work.

When it comes to problems where data is very difficult to get, choosing a prior takes the same kind of thinking as before, but more of it. You’re still trying to represent what knowledge you have, but you have to first think about all the knowledge you have and try to figure out what it would take to represent it. If you give me some kind of example problem you work on I could maybe give you some thoughts about how to approach it.

Let me add a slight tweaking, from my experience with remote sensing data to inform ecosystem models: sometimes there’s a wall of *noisy* data where theories are only moderately resolved on the ground, or in the stage of ‘multiple working hypotheses’. Choices of prior can be quite subtle in this context. But overall, I think your exposition is quite helpful.

Chris

Very informative posting and interesting discussion.

It’s sad that pretty much all the Bayesian papers that I get to review don’t deliver a proper justification of the prior at all.

I’m a bit skeptical about Daniel’s implicit message that within certain constraints posed by knowledge/reality the choice of the prior doesn’t matter that much. Ultimately whether it matters or not is decided by whether different priors that fulfill the same constraints and are therefore consistent with the same knowledge deliver the same or at least reasonably similar results after data.

“If you do an analysis and the posterior for a parameter is concentrated somewhere within the high probability region of your prior, over a significantly smaller region of the parameter space than the prior included… it probably doesn’t matter what your prior was, the data is controlling the inference.”

I wonder whether a theorem to this effect can be proved.

“There *is* a correct parameter value”

– in general, I’m not so sure.

> I wonder whether a theorem to this effect can be proved.

Not sure.

Although there does not seem to be any wide consensus – if the prior is understood as being fallible – something other than pointing to the posterior as the answer is advisable.

One can calculate predictive coverage given the prior and data model is correct – this will be p% for intervals that have p% posterior probability. If one separates these intervals by the underlying simulated parameters values, the % will be much less in the tails of the prior and more in the center.

One can calculate predictive coverage given the prior and data model is incorrect but it is used to form the interval now simulating from the correct but unknown prior – this will be different than p% for intervals that have p% posterior probability. But then it all depends…

I wonder whether Daniel would agree that there is a unique “correct prior”. This…

“Picking priors is in practice rarely all that hard. I think people have this impression that there’s some “correct answer” and you need to “get close” to that answer. This is a mistaken belief. The only “correct” answer is the one that encodes what little information you really do have.”

…could be interpreted either way, I guess.

If the “correct prior” is not unique, there is no well defined correct coverage percentage for anything, but if it is unique, picking it would seem rather hard, opposed to what Daniel claims.

Properly understood in terms of Cox’s theorem, a prior is a quantitative description of information in your knowledge base. So the “correctness” of the prior is entirely whether or not it does a good job of describing the knowledge in your knowledge base. Being explicit about the notation helps

The Posterior is : p(ParameterValue | KnowledgeIHaveAtHand) * p(Data | ParameterValue,KnowledgeIHaveAtHand) / Z

where Z is a normalizing constant, the first factor is the prior and the second factor is the likelihood… All of it is conditional on your KnowledgeIHaveAtHand and … for someone else who has different knowledge…. will be different.

It’s possible to get the “wrong” prior, but only in so far as it doesn’t actually encode things you know.

So for example, when I say “orange skin juices diffuse through the air really fast, must faster than any bacterial toxin in room temperature water… if I squeeze an orange peel, people start to smell it across a 3 meter room within 1 second or so… so the diffusivity of orange peel oils is something like 9 m^2/s so the diffusivity of my bacterial toxin is less than that… and has to logically be greater than zero:

P(DiffusivityOfToxin) = normal(0,9)

This does encode the state of information that I was considering.

Now if I say something like, “really the orange skin oils diffuse a LOT faster than any bacterial toxin in water… possibly 2 or 3 orders of magnitude slower for the toxin… I could try to encode THAT into some prior. maybe I do

p(Diffusivity) = lognormal(log(9e-2),log(10)) Truncated to the interval [0,9]

This is also a correct prior in that it encodes the state of information that I used, that the diffusivity was definitely less than 9, and seemed more likely to be on the order of 9/100 but could be 10 or 100 times bigger or smaller without me being too surprised.

Sorry, I was distracted it should say “P(DiffusivityOfToxin) = uniform(0,9)” not normal(0,9)

> actually encode things you know.

Well, you can’t encode things you don’t know, but you do know you won’t be right about the reality you are trying to deal with.

So the correct prior is something other than what you encoded and some consideration of how wrong you might be and what would repeatedly happen if you were this and that wrong is I believed required.

Without that – you fall into omnipotent self consistency – perhaps even with arguments that your prior should not be checked.

As David Cox recently put it “The approach placed prime emphasis on the internal consistency of an individual, You,

not on connection with the real world or on communication of conclusions to others.” https://academic.oup.com/biomet/article/103/4/747/2659040/Some-pioneers-of-modern-statistical-theory-a

“There *is* a correct parameter value”

– in general, I’m not so sure.

Well, how about “if your likelihood is a good description of the real world process, then there is a correct parameter value.”

Here is a potentially naive question:

Why don’t Bayesian analysts just work with and present the likelihood rather than the posterior, given that 90% of the complaints about Bayesian methods are about the prior? (Who knows, maybe they do — Bayesian methods are rare in my field and I don’t really understand them).

The only reason I can think of is that the likelihood is not a probability distribution. But you can still get the maximum likelihood estimate of the parameter and an estimate of the uncertainty in it from the likelihood, which is what most people are looking for, yes?

Cory:

What you’re suggesting is equivalent to using a flat prior distribution. There are problems where this will work, and there are problems where it will not work. A key advantage of Bayesian inference is the ability to include prior information. Restricting yourself to flat priors is like fighting with one hand tied behind your back.

But, sure, if maximum likelihood solves all your problems, go for it. I published my first statistics paper in 1990. It was for a problem that maximum likelihood could not solve.

In practice, people often get maximum likelihood to work by restricting their model space to low dimension. This is equivalent to putting a very strong prior on all the other dimensions that were excluded from the model to satisfy the goal of not using something called a prior distribution.

Hey Cory, you can’t marginalize out nuisance parameters without a prior. There are some tricks for dealing with nuisance parameters in a likelihood-based framework like profile likelihood and adjusted profile likelihood but (1) those are aimed at getting valid p-values, pfui, and (2) I’ve never seen them applied to really high-dimensional models nor to hierarchical models.

Cory:

Not withstanding the comments by Andrew and Corey, I think it is a very good question given (my sense that) most Bayesian analysts just work with flat or near flat priors in most applications.

One my definitions for frequentest statistics is trying (desperately) to get by without (explicitly) using a prior.

From this definition one could see most Bayesian analysts are trying (desperately) to get by without using much of a prior.

Without using much of a prior, it would arguably be silly to take posterior probabilities literally or in any way directly relevant and some measure of success in repeated use is highly advisable.

So what do Bayesian analysts using flat priors gain?

1. The work (especially mathematical) to get intervals with reasonable frequency coverage can be much less (any Bayesian analysis can have its frequencies properties evaluated and claimed as frequentest if the frequencies are uniform and close-enough frequentest if not too non-uniform.)

2. The near flat priors might reasonably represent the possible unknowns in field or a better sense of this is just lacking at present.

(Here it can be a I don’t have an informative prior now but I want to work this way for when I do get there.)

3. The folks the work is being done for, for some strange reason, believe an un-informed default-like Bayesian analysis is always superior and they likely do take the posterior probabilities literally :-(

Now carrying out a Bayesian analysis using a flat prior as a sensitivity analysis could always be good idea (I would try to always plot various marginal priors and marginal likelihoods to contrast the contributions).

I think 1 and 2 are good reasons but ideally one should strive for good (rather than just convenient) priors in the long run.

(Reality may not be mean, but I want both hands available to partially block the unexpected hooks.)

Thank you for your many helpful points. A few follow-up questions:

> flat or near flat priors

I will reveal my ignorance by not being completely certain what a “flat prior” means. I *think* it means a uniform distribution over some bounds a and b, and to run any program like Stan or PyMC3, you must specify a and b. And even this is equivalent to a strong assertion that a parameter value outside [a,b] is impossible; i.e., the range of p(theta) will be [a,b]. Is there such a thing as a “flat prior” from -inf to inf?

> Without using much of a prior, it would arguably be silly to take posterior probabilities literally or in any way directly relevant

Well, this is already news to me. I think in some ways Bayesians seem to be trying to have things both ways. The advantage that people claim for Bayesian methods is that you get p(theta) instead of p(data|theta) like in frequentist methods. But then non-experts are expected to know that p(theta) isn’t “real” or “relevant”?

I just tried to explain a simple OLS model like Y ~ A + B to a wet-lab biologist yesterday and it was a total failure. Interactions are completely out. How then can we expect them to understand the results of p(theta) on a non-informative prior?

My question was not really about statistical or mathematical reality, it was about communication and interpretability. I understand that including a prior gives you mathematical properties that allow you to do some extra things. But if the result of those “extra things” is a probability distribution that cannot be taken literally, what advantage does it have over frequentist methods?

Perhaps another way of putting this question is: if you do not have grounds for an informative prior, are there still any advantages to Bayesian methods that compensate for their increased complexity and difficulty in interpretation?

Cory:

A flat prior is a uniform density function over the entire range of parameter space. In Stan if you specify a data model but no prior distribution, the prior is flat (uniform) by default. It is not necessary to bound parameters in Stan. In general we recommend

notbounding parameters unless there is some logical constraint (for example, scale parameters being constrained to positive, or probabilities being constrained to be between 0 and 1). You can assign a flat prior density over the entire real line, but this can lead to an improper posterior distribution. We discuss this in BDA.Regarding your wet-lab biologist: What can I say? Lots of wet-lab biologists can use sophisticated statistical methods, or they can talk with someone who knows these methods. Some experiments are so clean that they don’t need any statistics at all. When measurements become noisy and the underlying phenomena become highly variable (yes, this does happen in biology!), then there’s more of a payoff to learning more statistics.

Beyond this, I recommend you read BDA and some of my applied research papers starting in 1990. Short answer is that if your method works for you, that’s great. I’ve used Bayesian methods for decades for problems where the simple methods don’t work so well. To put it another way, there existed no available methods to solve the problems I’ve worked on.

Finally, regarding “difficulty in interpretation,” try getting people to correctly interpret p-values! Not so easy at all.

Thanks, Andrew. I have been working through your “Multilevel/Hierarchical Models” and it has been tremendously helpful to my research. I have BDA but haven’t gone beyond the first few chapters yet.

I hope you don’t take my comments as criticisms of Bayesian methods or as if I’m trying to say frequentist methods are better. That’s not it at all. One of the reasons I frequent this site is because I see potential value in Bayesian methods at large and Stan and things you are doing in particular.

My questions were asked in the spirit of: “Why is Bayesian analysis done the way it is?” and “Assuming I can become proficient in these methods, is there any hope that I will be able to get wet-lab biologists to understand their results?”

I am glad you have found statistically literate non-statistician scientists to work with. I can only say that I haven’t been so lucky. You are absolutely right about p-values. Although I agree the payoff is there for becoming statistically proficient, I also have to work in the world I live in, which requires statistical results to be communicated in a very simple and clear way to people who find anything beyond a t-test to be quite confusing. I’m in bioinformatics and have no formal statistics training, but I’m trying to learn. You would be amazed (or dismayed) to know how many times I’ve been introduced as a “biostatistician” simply because I know how to use OLS.

Cory Giles:

“Is there such a thing as a “flat prior” from -inf to inf?”

Is there such a thing as an applied problem where it’s impossible to put a hard bound on the size of the parameter?

Put another way, is there an applied problem expressed in dimensionless form by a person who understands what and how to create dimensionless forms (in other words, not someone who is choosing the size of a unit specifically to foil my point) where a uniform prior on +- 10^(10^(10^10)) doesn’t include every conceivable value? Note that IEEE floating point only includes values between something like +- 10^(10^2.5) so my interval is REALLY BIG.

The universe is around 93 billion light years across (10^26 meters), if you measured it in atomic nuclear diameters (order of 10^-15 meters) you’d get a number around 10^41 or 10^(10^1.6)

“But then non-experts are expected to know that p(theta) isn’t “real” or “relevant”?”

Everyone should know that entering a number and then pushing a button at random on a scientific calculator doesn’t make the output number a real or relevant scientific discovery even though the calculator has the word “scientific” written right there on it.

If you don’t think about what or why you’re doing something you shouldn’t expect the result to be meaningful.

That being said, there are lots of problems where the likelihood so dominates the posterior that it doesn’t matter much what you choose as a prior. For example, if I use as my prior for your height in cm as exp(1.0/170) as described elsewhere above, and then I take one measurement with a typical measuring tape I’ll know your height to within about 2cm even if I’m sloppy as hell.

“if you do not have grounds for an informative prior, are there still any advantages to Bayesian methods that compensate for their increased complexity and difficulty in interpretation?”

Bayesian methods are

1) Easy and consistently interpretable (as opposed to “increased difficulty of interpretation”). Specifically, deriving Bayesian probability from Cox’s theorem shows us that Bayesian probabilities are the unique way to generalize “true vs false” to a number involving a degree of credibility. So the Bayesian posterior’s interpretation is always “if you are willing to accept the inputs as good approximations of your understanding about the values of numerical quantities, then the output probabilities are your new approximations to your understanding about those values”

Note the “if”, this is the same if as “if you are willing to accept the definition of a mammal as an animal that grows hair, then the Platypus is a mammal”

2) Increased complexity is exactly what is needed in a situation in which what is going on is in fact a complex thing. The biggest problem with Frequentist methods is often that the models just don’t correspond to anything meaningful. (for example, “testing the hypothesis that the cholesterol reduction values of a drug are normally distributed random values with exactly zero difference in mean between the placebo and the drug”)

3) I take the position that every applied problem I’ve ever heard of could have the above enormously broad prior attached to it, and therefore every “likelihood” based “Frequentist” analysis is in fact just a particular kind of Bayesian analysis. It’s NHST and p values that are “true” frequentist methods.

4) if you actually think for a few minutes and put any valid information at all into your choice of prior you will come up with an interpretable Bayesian posterior. The biggest problem is just that people don’t know what the interpretation is, in the same way as lots of people don’t really understand what a logarithm is. But a logarithm is a perfectly understandable thing.

Consider the alternative to a Bayesian analysis which outputs something along the lines of “The Grobnitz transformed ranked stochastic blarf test fails to reject the hypothesis that the data arise from a …. but the simpler L test does reject the hypothesis p=0.0435 … yet the foo-bar-baz-quux test gives exactly p=0.05…. since we rejected the hypothesis that XXYYZZY earlier, if we assume that Q is equal to its maximum likelihood value then the adjusted GROBNIZ test fails to reject the hypothesis that Z is less than 4 units…”

People who see such things should be asking questions like: “why did they choose the Grobnitz test, and is its result more or less relevant to the simpler L test? why don’t all the tests give consistent values? What assumptions underly the tests, and are those assumptions even approximately true about my scientific process? When we reject XXYYZZY why is it that we assume Q is exactly equal to its maximum likelihood value? What if it was a little bit different? How much different could it be? how would that affect the GROBNIZ test? If we conducted these tests in a different order would we get the same or approximately the same answer? If we reject the idea that Z is less than 4 units what does it mean for my decision making about whether to buy an expensive doctor’s treatment not covered by my insurance given that I have so much money and am a certain age and have been diagnosed with disease D?

Bayes has the answer to all of this sort of stuff:

1) It forces us to make the assumptions very clear (Stan is a language for specifying assumptions), and they are assumptions about what numerical quantities seem reasonable to us, and we get to assign those degrees of reasonableness based on whatever information we know. once we’ve done that initial assignment (heavy lifting outside Bayes required) there is a unique answer to how we should interpret the world under those assumptions (the posterior distribution, Stan gives us samples from it).

2) All the inference happens simultaneously among all the unknowns. The answer is the unique answer which given the assumptions matches boolean algebra in the limit as we approach perfect certainty. There is no dependence on the order in which we carry out inferences or on “special values” such as those picked out by maximization or minimization.

3) Post-inference, making the choice of what to do using Bayesian decision theory guarantees that you are using information about your desires and your knowledge in a consistent way that incorporates your information about all of the possible outcomes. (more heavy lifting outside Bayes required, specifically to quantify the goodness or badness of different outcomes)

The big “problem” then for Bayes is that it requires heavy lifting “outside” of the statistics. in other words, it requires you to think about science, about the way the world works, about what is and is not a “good” outcome (for decision theory) and soforth. It requires you to think. More than that, you have to also have the skill of translating what you think into mathematical language that you can calculate with. Often what is needed are two different people, one of whom has spent a lot of time thinking about the science (ie. a good scientist) and one of whom has a lot of experience listening to the words of scientists and converting them into applied mathematics and computer code (this is the Bayesian statistician, aka mathematical modeler).

The biggest objection to Bayes is that it doesn’t provide a canned set of buttons to push on a calculator. This is also its biggest advantage.

> It forces us to make the assumptions very clear

> The biggest objection to Bayes is that it doesn’t provide a canned set of buttons to push on a calculator

Yes, I believe this is the core of the issue. It is also at the heart of the misunderstandings and complaints about priors. The problems that arise from this are probably more sociological than statistical.

A common concern about Bayesian methods seems to be the fear that “if you allow analysts to bake assumptions into their results, you will get cherry-picked or biased results”. And given that human nature is what it is, and the way we see the “forking paths” problem all the time even with frequentist methods, I believe this is a legitimate concern. If Bayesian statistics became the predominant paradigm, we would likely see people manipulating their priors rather than performing lots of comparisons.

Secondly, I agree that scientifically, making your assumptions clear is the right thing to to do. But it also gives grant and paper reviewers a very easy target. Defending a prior would seem to be almost impossible to do (I have never done it) exactly because it is a judgment call and not usually based entirely on data. The “benefit” of frequentist methods in this context is that they too have assumptions, often bad ones, but the assumptions are standardized and therefore harder to manipulate directly. They are generally accepted as a playing field that may not be optimal or even meaningful, but by golly, it’s level.

Finally, going back to my very original post, I think I recall Andrew saying in some post that it would actually be nice if Bayesian analysts would present the likelihood rather than the posterior because it enables meta-analysis. And also, the likelihood is the portion of a Bayesian analysis that is based purely on the data. At the very least it seems to me that the likelihood is equally important to consider and report as is the posterior.

Cory:

You write, “Defending a prior would seem to be almost impossible to do.” I’ve been using prior distributions in published papers for nearly 30 years, and I don’t see it any more difficult to defend a prior than to defend a data model. I’ve seen a few zillion papers that use linear and logistic regression without offering any defense at all.

There are two problems here:

1) How should we do science? The answer is clear I think. We should think about what is known in the field, integrate our knowledge into a model, make our assumptions clear, and write Bayesian mathematical models to infer the quantities of interest.

2) How should we do academic stuff people call science? Here the question is very similar to “which good looking chimp should I groom so that later I will be able to have sex with powerful chimps within the clan and have relatively more successful chimp babies”. It has unfortunately little to do with (1).

Sorry, my sibling post may have come across as unnecessarily combative. What I am trying to figure out is: how can I practically use Bayesian methods in my research while avoiding the criticisms and misunderstandings that will likely come along with it? How can I, if indeed I can, take advantage of its power while avoiding uninformed complaints from reviewers and blank stares from collaborators? (Obviously these are rhetorical questions).

Cory:

You ask, “how can I practically use Bayesian methods in my research while avoiding the criticisms and misunderstandings that will likely come along with it?” All I can say is that I’ve been doing this for nearly 30 years and nobody seems to complain. Include your Stan code, let people criticize your choices, and use those criticisms to improve your model! Again, there’s nothing special about the prior: all aspects of the data analysis should be open to criticism.

Cory,

I’d add to what Andrew says: You can’t entirely avoid criticisms, misunderstandings, uninformed complaints, and blank stares.

But here are some suggestions for minimizing them and dealing with them:

1. Try to anticipate them — this means being clear in your own mind what you are doing and why.

2. When writing up your results, include explanations of what you are doing and why. If the reviewer or journal says to take these parts out, make them available in an online supplement.

3. Learn from your mistakes: If your reasons for what you are doing don’t convince critics, use their feedback to try to figure out a better way of explaining.

Cory:

I’ll add a few things here.

1. I think you are underestimating the importance of reasons 1 and 2 I gave – especially if the already developed techniques are know not to be adequate and you have to come up with a Gile’s transformed ranked stochastic blarf test (at a minimum you will need to have proofs for first order asymptotic Normality if not higher order).

2. Re: “likelihood is equally important to consider and report as is the posterior” – I would go further and say the data should be confidentially available to other researchers who will agree to maintain any necessary non-disclosure of participant’s data. The likelihood locks you into a data model that may well have been very poorly chosen and kept no matter how much the data complained it was inappropriate.

3. I think it is especially challenging to work in statistics in scientific settings and actually do more good than harm. It is really important to work with thoughtful folks and avoid working with the not so thoughtful or _academic gaming_ folks. I have worked with both, mostly regret my work with the latter group and I was forced out/left clinical research trying follow that strategy in the wrong setting.

In one of the last interview I had, I brought up the trade-off between enabling clinicians to do better research and making them happy. The interviewer thanked me for being candid about this issue and later indicated it helped them decide not to consider me further. In another, I was convinced the group was mostly thoughtful folks but their funding fell through.

Now don’t expect thoughtful very smart folks understand statistics, its your role to enable them to. Now, you might have to work with almost who ever you can until you get yourself in a position to be more selective. Here you can try to minimize the _damage_ they might inflict on science (don’t underestimate their potential to become thoughtful and practice Martha’s suggestions). There is simply no way to avoid hunting for and sticking with a work situation you are comfortable with.

Thanks, Keith, and everyone.

Regarding your point 1, I didn’t ignore your points on this subject, it’s just that I didn’t see them as terribly applicable to my situation. I would be totally incapable of evaluating the properties of a novel statistical test, just as I would at present be incapable of knowing with confidence whether a Bayesian analysis I did was correct. It is for this reason that I only use non-canned methods as a last resort.

I have only invented my own “statistical” test once, under extreme duress, and am far from certain that it is “correct” or has the right properties, although I did my best. The collaborator and reviewers didn’t care, because it returned the results they wanted/expected. Needless to say, telling you about this episode does not fill me with pride.

> I think it is especially challenging to work in statistics in scientific settings and actually do more good than harm.

I know. I often get the impression that the purpose of the statistician is viewed to be as a pro-forma role, giving the statistical “seal of approval” on whatever story the biologist wants to tell. Unfortunately I have several times had the unpleasant experience of giving results to a collaborator, it isn’t what they expected, and am told some variant of “do it again” (implicitly, until it returns the results they want, although they know enough not to say that aloud). This puts me in a very awkward situation because I am relatively junior. I do try to do the right thing but the pressure can be large.

> There is simply no way to avoid hunting for and sticking with a work situation you are comfortable with.

Well, I am very interested in applying bioinformatics and statistics to biological problem X. I work at an institution that is in the top 5 studying field X in terms of funding, department size, prestige, etc. They have many important findings regarding X, and it seems to be a good place to study X. There is a wide variety of levels of scientific ethics among the investigators here, but the level of statistical and computational literacy is uniformly low. The most advanced method used by the most statistically sophisticated biologist here is one-way ANOVA. From conferences and papers I get the impression that the situation is not much different at the other top 5 departments, with individual exceptions. I should say in the department’s defense that they are very biologically knowledgeable about X.

I am the closest thing the department currently has to a full-time statistician, which says volumes. At my institute at large there ARE many qualified and intelligent PhD-trained statisticians, but if I go to them for guidance, the response is usually either “I’m too busy” or a resigned and cynical “just use OLS or whatever is simplest, it doesn’t matter what you use, because if they like the results, they will keep them, and if they don’t, they’ll ask you to do it again or find someone else”.

So here I am, trying to do the best I can to get informed and do the right thing, because I really care about problem X and want to make the best of a bad situation.

Cory:

Likely would be good if you could find someone in the organization that could act as your mentor.

For instance, its much better if its them that raises possible concerns about what you are being asked to do, rather than you.

Likely better if their not a statistician but rather a concerned and respected researcher…

And, if you don’t have a Phd, find out a way to get one.

> And, if you don’t have a Phd, find out a way to get one.

I do. Although the situation is a bit odd: it is in “molecular biology”, but the actual research and PI was bioinformatics. This explains how I managed to become a bioinformatician without any formal statistics training…

> its much better if its them that raises possible concerns about what you are being asked to do, rather than you

> Likely better if their not a statistician but rather a concerned and respected researcher…

This is good advice and my boss/PI does do this occasionally, and it helps to a point. The limitation is that he knows no statistics either, so sometimes he cannot understand (or I cannot adequately communicate) my concerns. In those cases he tends to side with the collaborator.

What I am still looking for is a statistically literate mentor. I have found one who is knowledgeable and useful at explaining concepts but he is also one of the cynical types I mentioned earlier who has no qualms with cherry-picking in analyses.

Cory,

Do you use Cross Validated (http://stats.stackexchange.com/) or other Stack Exchange forums? This might be a way to get at least some feedback specific to your work and the resistance you find to using good practices.

Just to add my 2p worth, from experiences in a different but related field (I think – I’m in a medical school, but at the opposite end of the spectrum from lab research) … I’d say – don’t get discouraged, you are doing the right things! It may be worth trying to get people to understand what they are doing wrong, and why it matters. Has anything been published on statistical methods in the main journals in the field? If not, that might be an opening – lots has been written in other fields that could be summarised. I guess stressing the positives is the way to go – what can better nethodology enable us to do that we can’t do now? (avoiding errors is probably a large part of the answer)

It’s not easy – my experiences in a field where lots has been written about statistical issues and Bayesian methods are relatively familiar have not been particularly positive. I suspect there may not be any way to change the scientific practice in a field without upsetting some people – there are some very entrenched attitudes out there, and some people cannot deal with their world view being upset (especially when it is about statistics, which many don’t really want to engage with, even though it is central to their scientific enterprise). (though the hostile statisticians don’t have that excuse!)

May also be a good idea to talk to people from different fields (which I guess you are doing) – I’ve found it really beneficial to discuss issues with people in other disciplines, who face a lot of the same challenges with trying to change practice.

I’ve wondered quite a lot about statistical pracices in lab research but not looked in any more detail – so your comments are interesting to me. A lot of the clinicians I work with are also involved in lab stuff – so I wouldn’t expect a high level of statistical sophistication. And the results that come out of lab research are often part of the basis for clinical trials – so if they are unreliable, our decisions about which clinical trials to do may not be so good either. Maybe something I should think about more.

+1 to Simon Gates.

Thanks, Martha and Simon.

Martha, yes, I do use a wide variety of resources including people, SE, this blog, books including Andrew’s, etc. I think my statistical ignorance can be remedied given time, and I hope I haven’t given the impression I expect to be spoon-fed. The things you DON’T find in books, that you need other live people for, are not the technical skills, but advice on how to deal with tough scientific-social situations.

More than anything, I thought my tale of woe might be interesting and relevant to readers of this blog, because Andrew is all the time talking about bad analyses and ethical violations. Well, situations like mine are a recipe for those things. Although I do appreciate the encouragement.

Simon, your comment shows how my situation is both an advantage and disadvantage for my position in this field. There is a lot of demand for people like me and opportunities to publish on simple things like bringing post-1950s statistics to this field, and I am never short on datasets to analyze for collaborators. There are definitely people who see the value in it, although it is rare to find someone who understands what I am doing.

It is very rare in this field to even meet someone who understands a linear model interaction — if I do one I just get the proverbial “blank stare” followed by a request to just compare all the pairs of groups like normal people do. So the odds of getting people to understand a Bayesian analysis are low. And, as we all know, something a reviewer does not understand is something you’re going to get dinged on. I don’t think the problem is that they would be “upset” by a method, like Bayes, that they don’t understand, but rather that they would at best ignore the results, or at worst reject the paper on the partially reasonable basis that, if readers won’t understand it, it shouldn’t be published whether it is good or not. Another semi-rational basis would be that if a reviewer can’t understand it, they can’t do their job as a reviewer.

And yes indeed, sometimes this stuff ends up affecting clinical trials and actually affecting real people’s lives. So it is important to get it right.