We interrupt our usual programming of mockery of buffoons to discuss a bit of statistical theory . . .

Continuing from yesterday‘s quotation of my 2012 article in Epidemiology:

Like many Bayesians, I have often represented classical confidence intervals as posterior probability intervals and interpreted one-sided p-values as the posterior probability of a positive effect. These are valid conditional on the assumed noninformative prior but typically do not make sense as unconditional probability statements.

The general problem I have with noninformatively-derived Bayesian probabilities is that they tend to be too strong. At first this may sound paradoxical, that a noninformative or weakly informative prior yields posteriors that are too forceful—and let me deepen the paradox by stating that a stronger, more informative prior will tend to yield weaker, more plausible posterior statements.

How can it be that adding prior information weakens the posterior? It has to do with the sort of probability statements we are often interested in making. Here is an example from Gelman and Weakliem (2009). A sociologist examining a publicly available survey discovered a pattern relating attractiveness of parents to the sexes of their children. He found that 56% of the children of the most attractive parents were girls, compared to 48% of the children of the other parents, and the difference was statistically significant at p<0.02. The assessments of attractiveness had been performed many years before these people had children, so the researcher felt he had support for a claim of an underlying biological connection between attractiveness and sex ratio. The original analysis by Kanazawa (2007) had multiple comparisons issues, and after performing a regression rather than selecting the most significant comparison, we get a p-value closer to 0.2 rather than the stated 0.02. For the purposes of our present discussion, though, in which we are evaluating the connection between p-values and posterior probabilities, it will not matter much which number we use. We shall go with p=0.2 because it seems like a more reasonable analysis given the data. Let θ be the true (population) difference in sex ratios of attractive and less attractive parents. Then the data under discussion (with a two-sided p-value of 0.2), combined with a uniform prior on θ, yields a 90% posterior probability that θ is positive. Do I believe this? No. Do I even consider this a reasonable data summary? No again. We can derive these No responses in three different ways, first by looking directly at the evidence, second by considering the prior, and third by considering the implications for statistical practice if this sort of probability statement were computed routinely. First off, a claimed 90% probability that θ>0 seems too strong. Given that the p-value (adjusted for multiple comparisons) was only 0.2—that is, a result that strong would occur a full 20% of the time just by chance alone, even with no true difference—it seems absurd to assign a 90% belief to the conclusion. I am not prepared to offer 9 to 1 odds on the basis of a pattern someone happened to see that could plausibly have occurred by chance, nor for that matter would I offer 99 to 1 odds based on the original claim of the 2% significance level.

Second, the prior uniform distribution on θ seems much too weak. There is a large literature on sex ratios, with factors such as ethnicity, maternal age, and season of birth corresponding to difference in probability of girl birth of less than 0.5 percentage points. It is a priori implausible that sex-ratio differences corresponding to attractiveness are larger than for these other factors. Assigning an informative prior centered on zero shrinks the posterior toward zero, and the resulting posterior probability that θ>0 moves to a more plausible value in the range of 60%, corresponding to the idea that the result is suggestive but not close to convincing.

Third, consider what would happen if we routinely interpreted one-sided p-values as posterior probabilities. In that case, an experimental result that is 1 standard error from zero—that is, exactly what one might expect from chance alone—would imply an 83% posterior probability that the true effect in the population has the same direction as the observed pattern in the data at hand. It does not make sense to me to claim 83% certainty—5 to 1 odds—based on data that not only could occur by chance but in fact represent an expected level of discrepancy. This system-level analysis accords with my criticism of the flat prior: as Greenland and Poole note in their article, the effects being studied in epidemiology are typically range from -1 to 1 on the logit scale, hence analyses assuming broader priors will systematically overstate the probabilities of very large effects and will overstate the probability that an estimate from a small sample will agree in sign with the corresponding population quantity.

Rather than relying on noninformative priors, I prefer the suggestion of Greenland and Poole to bound posterior probabilities using real prior information.

OK, I did discuss some buffoonish research here. But, look, no mockery! I was using the silly stuff as a lever to better understand some statistical principles. And that’s ok.

“Then the data under discussion (with a two-sided p-value of 0.2), combined with a uniform prior on θ, yields a 90% posterior probability that θ is positive. Do I believe this? No.”

What exactly would it mean to “believe” this? Are you referring to a “true unknown” posterior probability with which you compare the computed one? How would the “true” one be defined?

Later there’s this:

“I am not prepared to offer 9 to 1 odds on the basis of a pattern someone happened to see that could plausibly have occurred by chance, nor for that matter would I offer 99 to 1 odds based on the original claim of the 2% significance level.”

…which kind of suggests that “I don’t believe it” means “it doesn’t agree with my subjective probability” – but knowing you a bit I’m pretty sure that’s not what you meant before. But what is it then?

Christian:

I wouldn’t bet on it at 9:1 odds, and I don’t think this event would occur 90% of the time in the long run under repeated trials. I don’t think this probability is calibrated; I think that this claimed 90% statement would actually occur less than 90% of the time.

Andrew: But you’re discussing “a claimed 90% probability that θ>0”, so you interpret “occur less than 90% of the time” over experiments that generate a θ? What would these experiments be in this case? I’d assume you don’t mean, as Anonymous suggested, that this is just about counting admissible thetas?

Christian:

As I said, I can give two answers. First, I wouldn’t bet 9-to-1 on it. Second, I think such claims are uncalibrated, and in many instances of this sort of claim, the event would happen less than 90% of the time.

See chapter 1 of BDA for more on how I view probabilities as theoretically defined and empirically measured.

That whole paragraph is a disaster. The calculations shows “certain information strongly points to theta being greater than zero”. This is absolutely believable because that information does indeed so point. According to Gelman if theta=0 there’s a 20% chance the result could have happened accidentally. So what? There’s a 40% chance that if theta=0+ epsilon then we’d see the result by chance. So if Gelman’s psuedo-frequentist logic supports the claim that theta=0 then it definitely supports that claim that theta is greater than zero.

The real logic driving this kind of inference works like this. Let’s examine all resonable thetas that could reasonably have produced the data. How many of them are greater than 0? Well if the only evidence you have about theta is this data, then about 90% of them do. There’s nothing to disbelieve here. That evidence supports that conclusion to that degree.

If you have additional evidence which says roughly “the only reasonable possibilities than for theta are near zero”, then in that case the answer to the question “how many reasonably thetas could reasonable produce the data” is about 60% of them.

It would have been far better to bypass the frequentist quagmire completely and say something like the following:

“my first calculation involving an uninformative prior doesn’t jive with my intuition because I’m intuitively using utilizing additional information that was never put into the equations. If I go ahead and put that additional information in, my intuition and calculations coincide”

> Let’s examine all resonable thetas that could reasonably have produced the data. How many of them are greater than 0? Well if the only evidence you have about theta is this data, then about 90% of them do.

Huh? Infinitely many thetas could have produced the data, reasonably or not. 90% of infinity is still infinity. I don’t think you’re writing what you mean to say.

You could rewrite that “90%” stuff in the language of measure theory if you really want to.

I could, but given the post I was responding to I don’t really want to.

Given that you could, why are you quibbling over technicalities?

Even better, nonstandard analysis makes it pretty much fine as is.

We’re speaking English here, so we should say things that make sense in English, not things that could in principle make sense when transformed by some vaguely invoked mathematics.

Maybe this makes my point simpler. Suppose the evidence shows the following:

(A) If theta =0 there is a 20% of accidentally having seen this data.

(B) There are 1,000,000,000 previously plausible values of theta greater than zero, each of which gives a 30% change of having seen this data accidentally.

And those are the only possibilities for Theta. Would anyone in this scenario be willing to use Gelman’s frequentist logic to cast doubt on the the claim there’s strong evidence theta is greater than zero?

Sorry, not any simpler – if anything it’s worse. Stating the number of plausible values of theta is still not helpful – there are infinitely many of them – and you’ve now introduced an idea of seeing the data “accidentally”.

I think Andrew’s article is fine – it makes the point clearly, in language I think his audience will understand. It’s very far from a “disaster”.

I explicitly said that one part, not the rest. To be more specific the reason Gelman gives here:

“Given that the p-value (adjusted for multiple comparisons) was only 0.2—that is, a result that strong would occur a full 20% of the time just by chance alone, even with no true difference—it seems absurd to assign a 90% belief to the conclusion.”

is serious nonsense of the kind that no Bayesian in 2015 (or 1915) should be muddying the intellectual waters with. So let me be clear, that 0.2 is NOT the reason there’s a problem. It could be .8 and there still be strong evidence for theta greater than zero. It could be .001 and there be strong evidence theta=0.

The actual source of the problem is that the non-informative prior includes so many reasonable possibilities for theta greater than zero, which we know from other evidence aren’t going to be true, that it swamps the “counts” in favor of positive theta. Although that’s perfectly valid as far as it goes it doesn’t square with our intuitive judgment because we mentally include that additional prior info. As Gelman describes, if you do include that info in the prior, it brings the equations back in line with our intuition.

Gelman’s bigger point presumably is that putting more information into the analysis sometimes makes the answer less certain (.6 instead of .9), which is a bit counterintuitive at first, but upon reflection is exactly how it should be. Sometimes more evidence does make you less certain. The takeaway is that we should carefully use whatever relevant info we have in an analysis. I agree with all that 100%.

i find this explanation and argument much more understandable.

Andrew:

In the studies you describe the priors were reasonable and the particular data noisy or crappy in other ways.

Won’t there be other situations where a weak prior leads to more plausible results?

I guess the answer to this is correlated to how much is known in the literature about the parameter/effect/hypothesis? Or is there actually a situation where using prior information would be misleading (compared to a weak prior), even if there is extensive literature on the subject?

The reasoning embedded in

“There is a large literature on sex ratios, with factors such as ethnicity, maternal age, and season of birth corresponding to difference in probability of girl birth of less than 0.5 percentage points. It is a priori implausible that sex-ratio differences corresponding to attractiveness are larger than for these other factors.”

is kind’ve disconcerting to me at the same time that it seems plausible.

It seems to suggest that effect sizes in any given area should all be expected to lie pretty close to one another—but shouldn’t that in itself be treated as a scientific question in need of its own modeling and independent research to establish, rather than adopted as a default assumption?

‘It seems to suggest that effect sizes in any given area should all be expected to lie pretty close to one another…’

It only suggests that if the previously measured effect sizes in a given area all lie pretty close to one another, it should be expected that the next effect size measured will be close to the previously measured ones. The clustering of effect sizes is not assumed in advance; the ‘a priori’ refers to the next effect size, but the assertion is conditional on (i.e., posterior to) the current body of evidence.

That seems like a reasonable principle; if we’re going to assert that effect sizes are clustered in an area, presumably that should follow from evidence that the Nth effect size discovered is predictable from the first N-1 effect sizes discovered.

Are ‘informative priors’ of this kind ever/often arrived at by formally estimating the distribution relating past effect sizes to one another? That seems itself like a very complicated problem, and one to which you could get very different answers depending on your procedure for estimating the “effect size distribution” in the first place.

I guess my rambling boils down to: is there a set of agreed-upon formal principles for estimating what an informative prior should look like in a given field? The ‘clustering’ reasoning seems plausible but very informal.

Formal principles, no. But there are guidelines on how to do it. See e.g. Kynn 1998, and the books by O’Hagan et al and Meyer and Booker.

Ah, yeah – I’m sort’ve distantly acquainted with the literature on combining expert judgments. That seems sufficient for any given application, carried out while working with a (usually small) group of experts, but it’s harder to see how it could be pragmatic for arriving at reasonable priors for an entire scientific field, unless large-scale field-wide elicitations (or samplings of elicitations, anyway) were to become a standard exercise.

Some of that stuff worries me. It reminds me of the oft mocked design by committee. If the priors chosen by individual experts really differ a lot, does computing an “average” prior really work out?

I instead viewed it as “Bigger effects are easier to find. So if several effects have been found already, it would be surprising to find a new one much bigger.” Finding a new but even smaller effect would be less surprising.

Andrew—

You are framing this as whether to rely on “uninformative priors” but isn’t this really the same issue you raised in your recent post on your objections to the ovulation-voting-impact study? Namely, how to “deal with” freakish, WTF-study research findings?

In any case, I feel irresistibly impelled to make the same response as I did to that post: instead of relying on priors to assess what weight to give to the evidence ( that’s confirmation bias, pure & simple), let’s just give the evidence the weight its due, update our priors, and get on with our lives.

As you do in order to enable us to get at the statistics or social-science-practice issue, I am assuming (counterfactually, to be sure) that the Kanazawa study is valid (i.e., the observations are of the sort one would make to draw an inference on true size of θ, and were appropriately made and measured).

If so, then the proper, Bayesian question is *not*, “Is the observed effect too far out of line w/ my priors for me to give it consideration?,” but rather “What is the likelihood ratio of the observed effect with respect to my current estimate of θ and its most serious rival?” (There could be multiple serious “rivals”—but I’ll stick w/ one rival hypothesis to make this simpler).

Imagine that Dr. Rubenfield, the author of the numerous studies you are relying on, thinks the true θ for “beautiful parents -> daughters” is a 0.5 % differential favoring female over male children. The notoriety-seeking Dr. Chewa, in contrast, believes the difference is 0.1% favoring female over male children.

Kanazawa’s observed 8% differential is way out of line w/ those two rival estimates of θ. But given the relatively large standard error—by my calculation 6%, based on your represented p-value of 0.2—a result like that isn’t so hard to conjure. As you say, we wouldn’t be shocked to see a result “by chance.”

But for precisely that reason—precisely b/c the measured effect is so imprecise—it adds virtually *no weight* to either side of the balance in our evaluation of the rival hypotheses. An observed effect of 0.08, given an SE = 0.06, is only 1.1 x more consistent with Dr. Chewa’s 0.01 hypothesis than with Dr. Rubenfield’s 0.005. Who cares what your prior probability was – whether it 25:1 in favor of Dr. Rubenfield’s hypothesis or 10^3:1—your revised estimate of θ isn’t going to move to any meaningful extent.

Case closed.

The practical result—there’s nothing to get excited about w/ Kanazawa’s result—is the same whether one uses this “likelihood ratio” assessment or your “too far out of line w/ my priors to take seriously” one.

But I think you understand your approach as counseling that we simply ignore evidence that is “too out of line with our priors.” I just don’t like that, b/c as I have said, it is tantamount to treating *confirmation bias* as a valid craft norm in social science.

In addition, “ignore the evidence” is the wrong remedy for the very real problem that motivates you here: the inferential illiteracy associated with treating any particular research finding as establishing the “true effect.”

An inferentially literate person recognizes that a valid empirical research finding never does anything other than give one more or less reason to credit one hypothesis relative to a rival than one had before. He or she also recognizes that *how much* more reason it supplies depends on how much more consistent the finding is with one hypothesis than with another—a matter that requires attending to the precision of the finding.

Your “ignore this awful WTF study” response to both the ovulation-impact finding and the “beautiful daughters” finding tries to head off the stupidity of treating those studies as if they established the true effect sizes of the phenomena they were assessing (by naively assuming 1:1 prior for the observed effect & by ignoring how measurement precision relates to the LR).

But so long as researchers are statistically illiterate, we’ll never be able to devise reliable schemes for preventing the production of stupid research and stupid reactions to the same.

The only reliable defense against a “WTF study” research culture is to insist that those doing studies be inferentially literate and not just statistical-software “button pushers”.

Dan:

I don’t think we’re in such disagreement. You attribute the following to me: “counseling that we simply ignore evidence that is ‘too out of line with our priors.’” But I don’t think I ever said that! What I wrote above is that,

ifyou are using a noninformative prior, you should be super careful about taking the resulting posterior probabilities literally. There is a naive view that certain statistical procedures (often involving p-values or noninformative priors) are “conservative” in the sense that, by using such procedures, practitioners are less likely to make wrong or silly claims. But that isn’t the case, at least not in general. The point of the above post is that seemingly-innocuous noninformative priors can result in posterior inferences that are strong and nonsensical. Just as the seemingly-cautious approach of relying on statistical significance can lead to all sorts of mistakes. Nowhere am I saying to ignore data; we just have to be more clueful in our interpretation of data analysis.Yes, I let my own priors on your motivations in addressing the issue misshape my understanding of your point. My bad.

But maybe I can still extract more enlightenment from you on how this ties in to bigger question of “how do data and experiments fit into a scientific research program? — since clearly that’s what this, like a great many of your posts, is focusing our attention on.

As you argue, it is for sure just a mistake to treat the observed effect in any sort of study as akin to a posterior probability. Observed effects (from valid studies) are random variables, or just points on a probability distribution for empirical observations of θ given our measuring capacity; accordingly, we *always* need to assess the relative consistency of any observed effect with the probability distributions of effects associated with competing hypotheses about the “true” θ in order to assess what information, if any, the study gives us about what θ truly is… Got it (I think/hope).

But then I think there still might be an issue we might not see eye to eye on in your selection of Kanazawa from this example.

*If* we can get people to stop making the mistake that you are focusing on, then do you agree we wouldn’t even need to bother to address the question of what sort of priors *anyone* should have in evaluating a study like Kanazawa’s?

If everyone knows that Kanzawa’s finding, like any other, is merely an observation waiting to be connected in the manner described to competing plausible hypotheses about θ, & *not* plausibly interpreted as an estimate of θ, who cares how consistent or inconsistent it is with our existing priors on θ?

If the study is valid, it conveys information *regardless* of what anyone’s priors are. Or differently stated, the information a valid study conveys is prior-neutral: *any reflective consumer of the information can supply his or her own prior and then decide how the information affects your assessment of θ.*

The only reason to have a position on whether the finding should be approached with “weakly informed” or “stronger” priors is to try to manage the conclusions that people might reach about what the finding signifies about the true value of θ.

There’s *no need* to manage that in a world in which producers & consumers of empirical evidence are inferentially literate– and indeed, any interest anyone takes in managing that will be giving (someone’s) priors a say in the weight to attach to evidence, a role priors just shouldn’t play in a Bayesian conception of empirical proof, which contemplates adjusting our priors based on weight of valid evidence, not using priors to determine how much weight valid evidence should be afforded.

From the point of view of *doing* science, good statistical hygiene is exhausted by attention to whether methods are valid & reliable and generate a likelihood ratio different from one (valid inference); whenever we try to make specification of priors part of the process for evaluating whether to “take evidence on board,” we risk contaminating the enterprise with confirmation bias.

So what’s wrong w/ this view?

The assumption that studies are valid (or not), and that you can know this. A valid study is unlikely to disagree wildly with prior information, but

Dan:

I appreciate that you keep plugging away at this. I hear ya, and I have some ideas, now I just need the time to think about them.

ANDREW: You say “The point of the above post is that seemingly-innocuous noninformative priors can result in posterior inferences that are strong and nonsensical.” I would certainly agree, (p-values say something much weaker.) it doesn’t follow that the p-values are non-sensible, of course. In fact, it’s interesting that you use the corrected p-value as a piece of info that makes the uncorrected posterior worrisome.

I was wondering about the event in your claim: “I wouldn’t bet on it at 9:1 odds, and I don’t think this event would occur 90% of the time in the long run under repeated trials.” Is the event the observed event (or inference) or an event in terms of the parameters. Or do you really want to say something rather different such as, I wouldn’t equate this with the kind of assertion I regard as warranted to .9. Or maybe, to take evidence like this as fairly good warrant for the assertion would not be a practice rarely wrong.

Finally, if a new result is drastically at odds with one that is regarded as having been well tested, as when the OPERA researchers reported neutrinos (or whatever they were) traveling faster than the speed of light–an anomaly for the extremely well tested special theory of relativity–wouldn’t scientists wish to keep the new data separate for purposes of analysis (as they did in this case)? Almost everyone had grounds to believe there was an experimental error, but by combining it with the vast evidence for the speed of light, it would have disappeared, and they wouldn’t have learned about engineering problems that can distort results in using this kind of equipment.