The idea that you don’t believe that theta is greater than 0 because “it is consistent with noise” really seems to be the failing to reject the Null implies the null fallacy.

Either you really doubt the prior (okay, that’s the point of this post…) BUT in a way that puts a spike at 0 since I assume you would say the same thing if we observed y = -1 (which I don’t think you like to do given your other posts) OR you really doubt the N(theta, 1) distribution of the data (valid doubt, but not about the prior!).

I get that it’s evidence that the maybe a flat prior is saying something…except I would guess an informative prior in this problem would be something like N(0, 10) which would result in a nearly identical answer!

]]>No, the above is not a post about p-values, nor is it a post about measurement errors. It is a post about Bayesian inference. I like Bayesian inference a lot—I wrote two books about it!—but certain natural-seeming models can yield posterior probabilities that don’t make sense (in some settings). That’s the subject of this post.

]]>Hmm… this seems to require me to perform a complicated piece of induction on the blog post.

Since you are talking about the statistical interpretation of performing a given measurement, and one could place that interpretation in a number of different (and unstated) contexts – and the interpretation would be correct and uncontroversial in *some* contexts – then, because you wouldn’t ever write about something that *was* correct and uncontroversial, I have to imagine another context under which an interpretation using an uninformative prior would be dangerous and/or controversial.

The sentence “Most of the published studies I’ve seen that have featured statistically significant p-values do not look like this” gives the game away: despite appearances, this isn’t a post about measurement errors, it’s another post in the ongoing series about p-values, or at least about mistakes made by people who habitually use p-values, or who would do so if they could get away with it any more. And the context is measuring alleged effects for which the default prior — one might almost say null hypothesis — is that their value is in fact very close to zero?

OK, so example 3 was of this sort; but example 4 doesn’t look like mutant-frequentism to me, on the face of it it’s a type S problem.

]]>Annan, in case I was misinterpeting your comment, the high probability manifold is:

W_beta = {x|P(x) greater than beta }

choose beta so W_beta contains almost all the mass. Say 99% of it.

]]>Lets suppose the true value is where the data says it. Take theta_true = 1. Then consider the following statemens:

A: “theta_true is in [-100,100]”

or

B: “theta_true is less than -5 or greater than 5”

Statement A is true while B is false. That’s what makes my prior better than yours.

]]>Why is your interpretation of the prior better than mine?

]]>It’s irrelevant what you think. If the true value is in [-100,100] the method and answer look great.

]]>I think the “high probability region of that prior is the real line *excluding* the interval [-5,5].

Doesn’t look such a good method (or answer) now, does it?

]]>Stringph:

I agree that there are settings where an (approximately) uniform prior distribution makes sense. These settings are those in which the data are much stronger than the prior. Your example of a precise temperature measurement and very weak prior distribution (merely the statement that a measurement was performed in a particular country) is one such example. Most of the published studies I’ve seen that have featured statistically significant p-values do not look like this. To put it another way, if a result such as p<0.05 is considered newsworthy, this already implies (in some sense) a strong prior centered around zero, so that it is considered something of a surprise for the measurement to be far from zero. But, yes, in regard to your comment, there are definitely settings where inferences from the noninformative prior are reasonable. In my post, I was focusing (implicitly) on the more controversial settings.

]]>Oh, given some contexts for typical use of non-informative priors which … I don’t have?

Would you say that my example of measuring the temperature at a mystery location in Canada is one of these typical cases?

If not, what would be a typical case where the inference is wrong?

And isn’t it deeply ironic that the example can, apparently, only be understood if the reader already possesses a lot of contextual information which is not given in the post?

]]>The information to assess whether 5-to-1 is correct or not isn’t given, but given the context in which non-informative priors are used, it’s probably going to be wrong.

The problem comes back to this issue of mixing definitions of probability. The non-informative prior was chosen to reflect a state of knowledge, so we can’t suddenly change the interpretation of the posterior probability as a frequency. For the posterior interpretation to work as a frequency interpretation, the prior has to be calibrated to reflect the base rates of theta values.

]]>Suppose you wanted to measure the temperature at some time and place and had no prior information except that the place was in Canada. Your observation is 1 deg above zero with 1 deg standard uncertainty; how certain are you that it’s really above zero? I would be fairly certain; the 5-to-1-ish odds ratio seems reasonable to me.

]]>Then the title of the post should be “The Hidden Dangers of Calibration”.

]]>@entsophy I suspect andrew might be assuming that calibration is a desirable property of a bayesian model. You might not agree with that goal, but I think that’s where the notion of a correct/incorrect prior is coming from.

]]>Bxg:

Of course it depends on the context. Depending on the scaling of the problem, an effect of 100 could make sense. I try to scale things so that effects are of order of magnitude 1. For example, in logistic regression you’re not going to see an effect of 100, similarly in econ you’re not going to see an elasticity of 100 if you’re working on the log-log scale.

With regard to your last point, I wouldn’t frame this as “second-guessing someone’s prior.” A better way to put it would be that people use conventional models that include much less information than is actually known. Such conventional models include linear regressions etc. as well as uniform prior distributions. If data are strong, you can often do just fine with conventional models. But if data are sparse, it can often make sense to go back and add some real information to your model, in order to better answer your scientific questions.

To put it another way, an analysis based on a conventional model can (sometimes) tell you what’s in the data. But scientific reports typically don’t just report information in data, they also make general claims about the world, and for that it can be a terrible mistake to ignore strong information that is already known.

]]>Konrad:

I’m referring to theta=0 as “pure noise” in the sense that, in this simple example, we can write the model as y = theta + epsilon, where epsilon is an independent error term. Here, theta is the signal and epsilon is the noise. If theta=0, that’s pure noise. I have no deeper meaning that that.

]]>Christos:

Yes, that’s my point. A conventional or purportedly noninformative model can be a useful starting point but we have to be ready to move on if it gives implausible inferences.

]]>The “problem” is of course that your “ignorant” prior assigns huge probability to theta being miles away from zero. Now if you’re talking about the obs being consistent with noise or not, you presumably thought there was a nontrivial probability of a zero (or at least v small) theta after all.

I’m not disagreeing with your example, of course. The issue (as I see it) is the assumption that a uniform (or indeed any other) prior can represent “ignorance”.

]]>> 4. Finally, the simplest example yet, and my new favorite: we assign a flat noninformative prior to a continuous > parameter theta. We now observe data, y ~ N(theta,1), and the observation is y=1. This is of course completely consistent with being pure noise, but the posterior probability is 84% that theta>0. I don’t believe that 84%. I think (in general) that it is too high.

Suppose I, in your presence, choose an independent uniform choice t from [-100, 100] – we agreed on this (it’s effectively fixed) and – then – observed y = t+1. Would you then feel that 84% posterior that theta > t is “too high”.

Because (and especially if you say no) it sounds as though you want to second guess someone’s prior on the basis on what subsequent questions they ask about the posterior. In practice, fair enough. In theory and philosophy, what a rathole.

]]>Oops, I meant “If the theta parameter could be anywhere from -100 to 100, and you draw from y ~ N(theta,1) and get a 1, it really is a very good bet that theta > 0.

]]>Yup.

]]>Like a few other people, I’m confused by your example. The math seems clear enough, and we can check it by simulation (I did this in R):

thetasim = runif(n=100000,min=-100,max=100) # instead of an infinite distribution, I’ll use uniform [-100,100]

ysim = rnorm(n=100000,mean=thetasim,sd=1)

Now look at all of the theta for which ysim was near 1; what fraction of these are from theta > 0?

yes1 = round(ysim) == 1

sum(thetasim[yes1] > 0)/sum(yes1)

For a particular set of random draws (the first and only one I’ve done), I got 0.82. If the theta parameter could be anywhere from -100 to 100, and you draw from y ~ N(theta,1), it really is a very good bet that theta > 0.

You obviously know this, so…I guess I don’t get the point of that example, which you say is your new favorite! Perhaps you’re saying that in most real-world circumstances that people use infinite uninformative priors, if they actually see a number that is near zero — anything with an absolute value below 10, maybe below 100 or 1000 — then they should reconsider their prior, because if the parameter value really could be “anything at all” then why is it so small, there’s probably a reason that we could figure out if we tried. Or something like that?

]]>Everyone look, if the diffuse prior leads to a posterior which says:

A: “theta’s in [-1,3]”

while a more informative prior says:

B: “theta’s in [-.1,.1]”

then if theta=0 both statements are correct. The latter is simply more informative. Since B implies A it’s not possible to say the former’s wrong while the later is right. Why is this so hard for people to understand? I really don’t get it.

]]>Number 4 is a nice example, but quite subtle. If you look at the non-informative prior as the limit of a sequence of increasingly diffuse normal priors centred on zero, then it puts too much prior weight on theta being very far away from zero, to the extent that the slightest evidence of positivity is over-interpreted (likewise negativity).

To be devil’s advocate, you might say that this prior is too informative because it assumes the variance is known, which is never true in practice. Any prior on the variance would alleviate this problem. If one used a Jeffreys prior on the mean and variance then the posterior would still be improper after one observation.

Or to be more even-handed, you could say that strong assumptions in one part of the model can bleed into so-called non-informative priors for other parameters, rendering them highly informative.

]]>> If someone misinterprets

I think that is the issue here, how does one _interpret_ posterior probabilities?

Obviously in the context of the appraised credibility of _the_ prior(s) and data model(s) used.

But Andrew seems to be pointing to the frailty of noisy data, even for thought experiment true models, perhaps in a Rubinesque repeated use relevant way? (1984)

(Perhaps something to work on over the weekend.)

]]>Andrew:

Thanks! I will read those.

PS. To clarify, I don’t think your models specifically are problematic; my concern was about using informative priors in subject areas where the priors are not strongly data-linked and hence where large flexibility & disagreement exists in the particular choice of priors.

]]>Bottom line: non-informativeness is in the eyes of the beholder. If there is a formulation of your problem that you are comfortable reasoning about, choose priors that best corresponds to your state of knowledge (or ignorance) in that formulation/parameterization. But don’t expect these non-informative priors of yours to map to non-informative “folklore” priors in a different parameterization.

]]>Rahul:

I recommend that you (and others who think my models are “a can of worms”) to read my recent AJPS paper with Yair and my forthcoming JRSS paper with Kenny (for details on two particular cases) and BDA (for more general principles.

]]>I love all these examples.

What I think is these are great cases where rich data exists to construct a good data-based prior. The applications of Bayesian reasoning that make me uncomfortable are ones in which researchers pull a fairly subjective prior out of a hat and multiple researchers do not even show much consensus as to what prior is the right prior. I feel that is a can of worms.

I suspect Andrew’s examples of voting, school vouchers, ethnicity etc. are in that category. I may be wrong. Perhaps there are obvious, uncontroversial priors there?

]]>I’m totally in agreement here. I think the crucial difference is the availability of abundant proxy data, enough to construct a credible prior.

]]>That makes sense. I just feel those are pretty niche situations.

e.g. If it were human babies a n=3 study of this nature is probably silly.

]]>Well it’s perfectly fine if you interpret and use it correctly. It’s saying something like 84% of possible values compatible with the evidence are greater than zero. If someone misinterprets this and does something stupid with that info then that’s on them, not the prior.

]]>Andrew: I’m trying to follow your reasoning here so your appeal to personal incredulity is rather unhelpful. It seems to me that 5:1 odds is a fine description for a pattern that easily occurs by chance – any poker player who is willing to draw towards a straight will back me up on this.

I’m unclear on why you are labeling theta=0 as “pure noise”, but it suggests that you have some concrete examples in your head. Is it perhaps the case that they are of the type where you have strong reason to expect that theta is close to zero (e.g. theta represents some effect that you expect may well be negligible)? Would you make the same claim if theta were, say, a temperature reading?

]]>Frequentists usually ask questions with this pattern: “I don’t believe there is _any_ sense in which there is a distribution over theta; nevertheless, what can I say?” (E.g. confidence intervals, hypothesis tests, etc). Answers to such questions mix notoriously badly with Bayesian approaches.

But _your_ frequentist question has a case where there is a real (and thus acceptable-to-him) distribution over theta, and he asks (by your account) the Bayes-rule-driven question: what is probability that theta is > 1 conditioned on his data. He’s probably feeling lucky, because that’s not his everyday case (and in his everyday case, he is NOT going to make up a distribution.) But given that he has this distribution, where do he and the Bayesian collide? His distribution might not match your [-100,100] prior, but then two Bayesians might disagree too. I just don’t see what nonsense one can generate vis-a-vis _this_ Frequentist Question.

Sorry for missing your point.

]]>Also typically the parameter estimated with a reference value of 0 has a qualitative difference when it’s below or above the breakpoint. It’s typically some kind of multiplier, so positive values imply one direction and negative values imply another. This sort of type S error can lead to wrong thinking about unknown but hypothesized mechanisms, which then lead to wrong thinking about the structure of the next more complicated model you fit. There’s a kind of risk aversion to interpreting the signs of multiplicative parameters too strongly in the absence of real data, because it can lead you down the wrong path with further models.

]]>I had a case where I had only a few direct observations of what I wanted to measure, but a lot of data of something that could be considered a proxy for what I wanted to measure. So I constructed an informative prior from the proxy data, and combined that with a likelihood from the sparse observational data.

This was an e-commerce application; unfortunately I can’t go into the details.

]]>Right. Whereas in examples typical of Joseph’s interests, there’s not necessarily such a relevant “break point” for example like the inner diameter of a certain pipe coming out of a machine. It’s a positive number, it should be around 0.500 inches because that’s the nominal specification, but it can vary a fair amount, maybe 0.01 inches depending on the temperature of the machinery and the wear that it has undergone, in those kinds of situations if you use N(theta,0.01) and the data is 0.510 you aren’t going to say “geez there’s no way the mean diameter has an 85% chance that it’s bigger than 0.5” or something like that, because there isn’t hidden implicit prior data.

In other words, your complaint is relevant in a certain context where everyone already knows implicitly that there’s a special value 0 and your priors really should be taking this into account.

]]>Dan:

Yes. To put it another way, the very fact that zero is being used as a comparison point (with statements such as, “the estimate is only 1 standard error from 0”) typically implies a prior distribution in which zero plays a prominent role.

]]>I’m going to take a different approach to interpreting Andrew’s example #4. I think in the type of problem Andrew works with, first of all the Likelihood D ~ N(theta,1) is itself a very approximate idea of what we know about real data. I mean, where does that fixed 1 come from in the variance? Do we really know the variance is exactly 1 but have almost NO idea what theta is? Come on. In situations like that, you almost always will have some knowledge about theta. And furthermore, in the type of models Andrew usually works with, theta is generally an “effect” type parameter, measuring how much something generally changes when some other data value changes. Whether it’s a causal parameter or not, zero is an obvious place to put a lot of these effect type parameters, since it’s easy to imagine that there are lots of relationships that have zero real effect. For example, the effect size of brand of coffee you drank this morning before coming to perform a psychology experiment about ESP… it’s just often the case that we have a strong bias towards having 0 for effects.

So, Andrew’s more or less saying that in that context, we should often be using prior information to constrain things to be not so subject to noisy random small data sets.

]]>Joseph:

I just don’t believe that P(theta>0|y)=0.84. To put it another way, I don’t think that, if the study were repeated with a huge sample size, there’s an 84% the result would go in the same direction. 5:1 odds seem too strong to me, for a pattern that could easily have occurred by chance.

]]>The 95% credibility interval estimate for theta you get from that posterior (with highly diffuse prior) is going to be something like [-1,3] which includes zero.

So you think the data’s consistent with theta=0 and the posterior thinks the evidence is consistent with theta=0. However, the evidence is also consistent with other values of theta, which the posterior also takes into consideration. It’s a mystery why you think that’s a bad thing.

]]>http://xianblog.wordpress.com/2013/11/21/hidden-dangers-of-noninformative-priors/

]]>Examples are routine in my field, evolutionary anthropology. A common problem is radio carbon (and other types of) dating. Typically we get a posterior density for radio carbon date. But we also know a lot of other things, like for example that all the dates from the same stratigraphic layer must fall within the same layer. We use strong joint priors on the dates to update the posterior density of each date, with really nice inferential results. See for example Figure 1 in http://www.pnas.org/content/108/21/8611.full

Basically, we use strong priors to combine different types of data about the same thing. And we never have the data we wish to have, just the data that we happen to have. So even a little bit of information in a prior can help a lot. The thing to note about these examples is that the prior is “strong” only in particular regions. It mainly serves to jointly truncate the radio carbon estimates.

]]>If theta=0 (i.e., pure noise), there’s no surprise at all if the estimate is one standard error away from 0. Such a result is completely consistent with noise.

]]>