Robert Bell pointed me to this post by Brad De Long on Bayesian statistics, and then I also noticed this from Noah Smith, who wrote:

My impression is that although the Bayesian/Frequentist debate is interesting and intellectually fun, there’s really not much “there” there… despite being so-hip-right-now, Bayesian is not the Statistical Jesus.

I’m happy to see the discussion going in this direction. Twenty-five years ago or so, when I got into this biz, there were some serious anti-Bayesian attitudes floating around in mainstream statistics. Discussions in the journals sometimes devolved into debates of the form, “Bayesians: knaves or fools?”. You’d get all sorts of free-floating skepticism about any prior distribution at all, even while people were accepting without question (and doing theory on) logistic regressions, proportional hazards models, and all sorts of strong strong models. (In the subfield of survey sampling, various prominent researchers would refuse to model the data *at all* while having no problem treating nominal sampling problems as if they were real (despite this sort of evidence to the contrary). Meanwhile, many of the most prominent Bayesians seemed to spend more time talking about Bayesianism than actually doing statistics, while the more applied Bayesians often didn’t seem very Bayesian at all (see, for example, the book by Box and Tiao from the early 1970s which was still the standard work on the topic for many years after). Those were the dark days, when even to do Bayesian inference (outside of some small set of fenced-in topics such as genetics that had very clear prior distributions) made you suspect in some quarters.

Every once in awhile you still see old-school anti-Bayesian rants but not so much anymore, given that they can be so easily refuted with applied examples (see, for example, here, here, and here).

So really, no joke, I think we’ve made a lot of progress as a field. Bayesian methods are not only accepted, they’re thriving to the extent that in many cases the strongest argument against Bayes is that it’s not universally wonderful (a point with which I agree; see yesterday’s discussion).

Another sign of our progress is the direction of much *non*-Bayesian work. As noted above, a lot of old-style Bayesian work didn’t look particularly Bayesian. Nowadays, it’s the opposite: non-Bayeisan work in areas such as wavelets, lasso, etc., are full of regularization ideas that are central to Bayes as well. Or, consider work in multiple comparisons, a problem that Bayesians attack using hierarchical models. And non-Bayesians use the false discovery rate, which has many similarities to the Bayesian approach (as has been noted by Efron and others). This really is a change. Back in the old days, classical multiple comparisons was all about experimentwise error rates and complicated p-value accounting. The field really has moved forward, and indeed one reason why I don’t think Bayesian methods are always so necessary is that non-Bayesian methods use similar ideas. You could make similar statements about machine-learning problems such as speech recognition, or (to take an example closer to De Long and Smith’s field of economics) the study of varying treatment effects.

**Take-home message for economists**

One thing I’d like economists to get out of this discussion is: statistical ideas matter. To use Smith’s terminology, there *is* a there there. P-values are not the foundation of all statistics (indeed analysis of p-values can lead people seriously astray). A statistically significant pattern doesn’t always map to the real world in the way that people claim.

Indeed, I’m down on the model of social science in which you try to “prove something” via statistical significance. I prefer the paradigm of exploration and understanding. (See here for an elaboration of this point in the context of a recent controversial example published in an econ journal.)

Here’s another example (also from economics) where the old-style paradigm of each-study-should-stand-on-its-own led to troubles.

A lot of the best statistical methods out there—however labeled—work by combining lots of information and modeling the resulting variation. And these methods are not standing still; there’s a lot of research going on right now on topics such as weakly informative priors and hierarchical models for deep interactions (and corresponding non-Bayesian approaches to regularization).

**The case of weak data**

Smith does get one thing wrong. He writes:

When you have a bit of data, but not much, Frequentist – at least, the classical type of hypothesis testing – basically just throws up its hands and says “We don’t know.” It provides no guidance one way or another as to how to proceed.

If only that were the case! Instead, hypothesis testing typically means that you do what’s necessary to get statistical significance, then you make a very strong claim that might make no sense at all. Statistically significant but stupid. Or, conversely, you slice the data up into little pieces so that no single piece is statistically significant, and then act as if the effect you’re studying is zero. The sad story of conventional hypothesis testing is that it is all to quick to run with a statistically significant result even if it’s coming from noise. In many problems, Bayes is about regularization—it’s about pulling unreasonable, noisy estimates down to something sensible.

Smith elaborates and makes another mistake, writing:

If I have a strong prior, and crappy data, in Bayesian I know exactly what to do; I stick with my priors. In Frequentist, nobody tells me what to do, but what I’ll probably do is weaken my prior based on the fact that I couldn’t find strong support for it.

This isn’t quite right, for three reasons. First, a Bayesian doesn’t need to stick with his or her priors, any more than any scientist needs to stick with his or her model. It’s fine—indeed, recommended—to abandon or alter a model that produces implications that don’t make sense (see my paper with Shalizi for a wordy discussion of this point). Second, the parallelism between “prior” and “data” isn’t quite appropriate. You need a model to link your data to your parameters of interest. It’s a common (and unfortunate) practice in statistics to forget about this model, but of course it could be wrong too. Economists know about this, they do lots of specification checks. Third, if you have weak data and your prior is informative, this does *not* imply that your prior should be weakened! If my prior reading of the literature suggests that a parameter theta should be between -0.3 and +0.3, and then I get some data that are consistent with theta being somewhere between -4 and +12, then, sure, this current dataset does not represent “strong support” for the prior—but that does not mean there’s a problem with the prior, it just means that the prior represents a lot more information than you have at hand.

I very much respect the idea of data reduction and summarizing the information from any particular study without prejudice, but if I have to make a decision or a scientific inference, then I see no reason to rely on whatever small dataset happens to be in my viewer right now.

In that sense, I think it would be helpful to separate “the information in a dataset” from “one’s best inference after having seen the data.” If people want to give pure data summaries with no prior, that’s fine. But when they jump to making generalizable statements about the world, I don’t see it. That was the problem, for example, with that paper about the sexes of the children of beautiful and ugly parents. No matter how kosher the data summary was (and, actually, in that case the published analysis had problems even as a classical data summary), the punchline of the paper was a generalization about the population—an inference. And, there, yes, the scientific literature on sex ratios was indeed much more informative than one particular survey of 3000 people.

Similarly with Nate Silver’s analysis. Any given poll might be conducted impeccably. Still, there’s a lot more information in the mass of polls than in any single survey. So, to the extent that “Bayesian” is associated with using additional information rather than relying on a single dataset, I see why Nate is happy to associate himself with that label.

To put it another way: to the non-Bayesian, a Bayesian is someone who pollutes clean data with a subjective prior distribution. But, to the Bayesian, a classical statistician is someone who arbitrarily partitions all the available information into something called “the data” which can be analyzed and something called “prior information” which is off limits. Again, I see that this can be a useful principle for creating data summaries (each polling organization can report its own numbers based on its own data) but it doesn’t make a lot of sense to me if the goal is decision making or scientific inference.

What’s a good example of a real-life study whose big-picture conclusion was drawn substantially differently after reapplying Bayesian methods as opposed to conventional ones? Any mainstream examples where that actually happened?

Even better, somewhere where maybe applying Bayesian analysis would have led to a substantially different(hopefully better!) policy prescription or decision?

Rahul:

Yes, there are many examples. See Bayesian Data Analysis to start with, we have lots of applied examples there.

Because Bayes’ Theorem is a perfectly respectable frequentist result, it has always seemed to me that an approach that emphasizes use of the theorem (i.e., a “Bayesian” approach) must ultimately be consistent with one that uses some other route (e.g., “frequentist”). If that doesn’t happen, there is something wrong with the implementation of one or both approaches.

You could legitimately argue, though, about the mathematical relationship if any between probabilities and beliefs.

If $latex \theta$ is an unobservable parameter and $latex y$ is observable data, then the version of Bayes’s theorem relating the posterior to the prior and likelihood function, $latex p(\theta|y) \propto p(y|\theta) p(\theta)$, is not acceptable to a frequentist because it involves probability densities over parameters, namely the posterior $latex p(\theta|y)$ and prior $latex p(\theta)$.

“If my prior reading of the literature suggests that a parameter theta should be between -0.3 and +0.3, and then I get some data that are consistent with theta being somewhere between -4 and +12, then, sure, this current dataset does not represent “strong support” for the prior—but that does not mean there’s a problem with the prior, it just means that the prior represents a lot more information than you have at hand.”

I’m a little confused by this. If your posterior actually moves to between -4 and 12, doesn’t that weaken support for the prior? Put another way, shouldn’t prior+data –> posterior imply that the derived posterior is your prior in the next study? Shouldn’t really weak data simply mean that you don’t budge the prior very much? Put it another way: suppose you had a fairly strong prior based on a couple of small studies and some strong intuition, but you had an impeccable 2,000,000 observation data set which widened the posterior substantially. Why does this evidence that theta is not so neatly cabined as your prior would have led you to believe not grounds to widen the prior?

Yes, you do want to use your latest posterior as the prior for further data. If the prior’s in the right ballpark, the posterior will have lower variance than the prior. If the prior’s not very strong (high variance) and you have a lot of data, then the posterior will have lower variance than the prior.

Just as a thought experiment, imagine you have a normal(0,.15^2) prior (95% intervals about (-0.3,03) and try to imagine a data set where the posterior has a 95% interval of (-4,12). The required data will have to have a very wide scale to have that much uncertainty and thus be incompatible with the prior.

Jonathan:

When I wrote “some data that are consistent with theta being somewhere between -4 and +12,” I mean that to be a statement about the data, irrespective of the prior. I was not saying that my

posteriorinterval was [-4,12].Thanks. But I guess what’s left unstated is the underlying number of effective observations in the prior. I like to think of the prior as the equivalent of n old observations which are now augmented by m new observations. (This is one dimension of the information content of the prior.) The strength of the prior is not just the variance of theta in the prior, but the number of (effective) observations that underlie it, no? So, taking Bob Carpenter’s point, while you would certainly like the posterior’s variance to shrink, noisy data will always tend to widen it, albeit often insignificantly. It just won’t be much of a problem if there isn’t much of it in terms of the relative weights of m and n.

Different Jonathan:

Sounds like you are thinking too narrowly. Consider, for example, a phylogenetics paper I read last week. The problem there was to use two very different kinds of data: genetic data from extant species, and information from dated fossils. The goal was to infer a phylogenetic tree. The method was to use the fossil data to put priors on some of the branching points of the tree. Your paradigm of “n old observations augmented by m new observations” doesn’t fit.

Another example (perhaps less esoteric): In trying to evaluate effectiveness of tamaflu on children, no clinical trials were available, but there were data collected by a health maintenance system on which children had received tamaflu and the outcomes, plus other information about the children. But only children that the physicians considered most likely to benefit were given tamaflu. The researchers interviewed the physicians to try to understand how they (the physicians) decided which children to give tamaflu. They then used this to construct a prior to use with the data, to give more plausible results than just using the observational data. Again, no new observations to augment old observations — instead, a prior was used to combine additional information (“soft” data, if you will) with the information from the observational data.

Oops — I didn’t mean to contribute anonymously; just forgot to put in the needed information before I clicked Submit.

Martha: The second one sounds like an application of Sander Greenland’s multiple bias analysis.

With just the data, bias parameters are not identified (or almost) – so you are absolutely right there really is no new observations that inform the biases.

Simply plotting posterior/prior is a really easy way to see this, easier than working out an effective sample size.

Thanks Martha. I realize that, but I meant the point metaphorically. Even in your examples, you can draw a rough equivalency between the effect of the prior and some set of hypothetical data that would have moved the frequentist confidence interval (and mean) towards the posterior. That’s all I meant. I find it helpful to actually do that to tell myself just how much effect the prior is having. And to K?’s point below, I don’t need to do it precisely… it just helps me think about what’s going on qualitatively.

@K?: I don’t think the second example is an application of Greenland’s multiple bias analysis — although the authors do reference a paper of Gustafson and Greenland that had exhibited the same phenomenon of getting a smaller credible interval from an informative prior than from a degenerate prior.

Also, I mis-stated the example — I was relying on memory from a talk. In fact, the goal was to evaluate the effectiveness of a flu vaccine; the problem was that physicians had discretion in whether or not to take a throat culture to verify that the presenting symptoms were indeed from flu. The prior modeled information elicited from the physicians on how they decided whether or not to take a culture. If you want to look at it: The paper is Scharfstein et al, On estimation of vaccine efficacy using validation samples with selection bias, Biostatistics (2006), 7, 4, pp. 615–629.

@(different) Jonathan: I’m not convinced that your metaphor is not more misleading than helpful. In both examples, the prior was incorporating real additional data (i.e., information), not “hypothetical data”, but the data were not of the same type as the “main” data. So it’s not old being augmented by new; the information/observations are of an entirely different type. In the second example it’s somewhat similar to incorporating a covariate, but in the first, it’s really just a way of combining two different types of data (genetic data on extant taxa and time data on extinct taxa).

Actually, in the phylogenetic example, the authors used two possible priors (normal and uniform) for the dating of the nodes corresponding to the fossils, and plotted the normal prior together with both induced posteriors of the dating of those nodes, to see the effect of the genetic data on the node dating. It’s an interesting paper; in case you want to take a look, it’s Wheat and Whalberg, Phylogenetic Insights into the Cambrian Explosion … and the Evolution of Flight in [insects], Syst. Bio. 62(1)93-109,2013

For what it’s worth, here’s my own experience with frequentist vs Baysian, to compare and contrast with Andrew’s:

I got involved in statistics about fifteen years ago, after about thirty years as a pure mathematician. My introduction to Bayesian analysis was by sitting in on a course taught by an astronomer who was an ardent Bayesian. I had already been teaching graduate as well as undergraduate courses in applied statistics. I adopted the view that frequentist and Bayesian perspectives are more complementary than opposed – each has some unsatisfying aspects, but what we are dealing with when using statistics is something that we can’t entirely get our hands on, so being able to approach problems from two different perspectives is better than being restricted to just one perspective. However, my impression at that time was that Bayesians were more likely to be anti-frequentists than frequentists were likely to be anti-Bayesian. This seems to have lessened in the intervening years, however.

Being a mathematician, I am of course against claims of “proving something” with statistics, whether it’s Bayesain or frequentist. I tried to teach my students that it is misleading to say we have “proved” something with statistics. But I can’t entirely go along with Andrew’s paradigm of “exploration and understanding” either. Exploration is good; but claims of understanding are to me pretty much like claims of having proved something. I think that maintaining uncertainty about understanding is important – we try to understand, we may (being human) think we understand, but there is always the possibility of new information that can call our presumed understanding into question. If I have a paradigm for statistics and science, it’s the paradigm of “seeing through a glass darkly.” I (optimistically) think that we can continue to shed light on the world (and that using statistics well is part of that process), but (realistically) don’t think we will ever see “face to face”.

[…] Gelman tem dois posts (1 e 2) sobre o assunto que merecem ser lidos (e lá você encontrará links para os demais posts de […]

In real world data analysis, having an intuitive grasp of what sounds plausible — good priors — is the most important part.

There are plenty of applications in which you can collect data, run regressions, and then collect more data out-of-sample and test your theory. And, guess what, knowing what to look for helps immensely. When you know what to look for and then use p values you are implicitly using a Bayesian approach. When you don’t know what to look for and find a senseless correlation you are Oded Galor.

anyone else catch the serious error in the last paragraph before “methods” in the archive pdf ?

They only looked at papers in 5 top top journals

I don’t know how you assess the quality/prestige/difficulty of publishing in NEJM or Lancet compared to the “avg” journal, but it is a big number

that is, papers published in NEJM are not, by any stretch of the imagination, average; they are way higher quality

but, don’t believe me; go and ask any MD or PhD doing clinical research

the authors sort of wish washy deal with this in the last paragraph of the discussion

if the authors are good at perl and stuff like that, thre was an objective test : it must be possible to search for later p values on the same subject….that is, they had 5,000 odd p values, from 2000 to 2010

so, you start with the p values from 2000, and you ask, going forward, how many are disproved..that puts some sort of bound on your data

Bayesianism implies that stereotypes and prejudices deserve respect.

Not quite. Bayesianism implies that were human reasoning not subject to anchoring bias, overconfidence, motivated reasoning, and other cognitive biases, then stereotypes and prejudices would be worthy of respect because they would only form in reaction to actual evidence.

Those links in Andrews post contain some classics of their kind. I love the one from Cosma Shalizi titled “Some Bayesian Finger-Puzzle Exercises, or: Often Wrong, Never In Doubt”. I take it that Brad Delong considers this some kind of state-of-the-art criticism of Bayesian Statistics.

see: http://vserver1.cscs.lsa.umich.edu/~crshalizi/weblog/606.html

Example 1 considers a prior consisting of two Gaussians one centered around +1 and one centered around -1. Then he simulates data from the prior and then gets from Bayes Theorem that the posterior estimate of the mean thus wanders from being close to +1 to being close to -1 and back erratically.

This is actually a big problem for Frequentists. Frequentists like Dr. Mayo are wont to point out they have no problem with Bayes Theorem in principle as long as the “prior” is a legitimate Frequentist prior. Well what could possibly be a more legitimate Frequentist prior than one that is actually used to simulate the Data! So Frequentists either need to reject the product rule of probability or explain why they don’t get the result they expect.

From a Bayesian point of view everything is fine. The high probability region of the prior has two branches one located near +1 and one near -1. So the prior is basically saying “x is either near +1 or it’s near -1”. As the data is collected, the posterior is favoring one or the other of those two possibilities depending on what the data indicates. It makes perfect sense.

At least it makes perfect sense if you know that probabilities distributions are not frequency distributions.

@Entsophy, Dr. Mayo often points out that, in her view, the expression of degrees of belief through the probability calculus (i.e. as priors and posteriors) is unnecessary and unhelpful. That’s not what you wrote.

Also, Cosma Shalizi’s neat little example concludes stating “that the formal [Bayesian] machinery becomes so certain while being so wrong”; you saying that “everything is fine” does not agree with this.

George: Mayo and most Frequentists do not deny the truth of Bayes Theorem. They just think it should only be applied when then priors have a good Frequentist justification. Pretty much all Frequentists think this.

The problem with Shalizi’s example is that from a Frequentist point of view, the prior does make sense. So from a Frequentist viewpoint it IS legitimate to use Bayes Theorem in this example. If it gives an outcome that Frequentists don’t like then that’s a problem for them and they have some explaining to do.

I claim that from a Bayesian understanding of what probability distribution are, then the answer makes perfect sense. The Bayesian machinery is basically being asked the question:

“the answer is either +1 or -1, so make the best guess as to which it is based on a given set of data”.

Now that may be a dumb, uninteresting, or malformed question, but the Bayesian machinery doesn’t know that. It’s just trying to do the best it can to answer it. The solution isn’t wrong, it’s just answering a different question than a Frequentist thinks it’s answering.

@Entsophy; Bayes theorem is a simple result in conditional probability. No-one doubts that it’s true.

Where and how it should be applied is a different matter; Dr Mayo and others refute the idea of using Bayes theorem to provide a summary, in the form of a probability distribution, of what’s known about a parameter. They do this even when (somehow) the prior matches the data-generating mechanism. You’re overlooking this.

Shalizi’s examples are explicitly, in his own words, “simple yet pointed examples where Bayesian inference goes wrong”. Your claim that they are somehow a “big problem for Frequentists” doesn’t follow.

Finally, your assumption that Frequentists don’t know what question Bayesian analyses are answering is really unhelpful. There are many good reasons to doubt question formal Bayesian inference; just accusing people of ignorance overlooks this.

This is pretty simple: Prior + Product_Rule_of_Probability = Posterior

Frequentists insist this is fine if the Prior is a frequency distribution. In this problem the prior is exactly the correct frequency distribution. Therefore a Frequentist should have no problem with the Posterior.

But Shalizi (a Frequentist) has a huge problem with the Posterior and claims it’s giving complete nonsense.

@Entsophy “Frequentists insist this is fine”. No they don’t; see above. Posterior statements of belief are anathema for them. Read Mayo on this stuff; you might not agree with her but she’s unequivocal.

Please also read Shalizi; he is also a good deal more nuanced in his criticism than you give him credit for. If it’s too confusing, go implement his example and see what you get.

Mayo has stated half a dozen times that Bayes Theorem does apply if you are using a legitimate Frequentist distribution for the prior rather than a Bayesian “degree of belief” distribution for the prior.

You’d be very hard pressed to find a Frequentist who thinks different.

George, let me try to clarify again:

I never mentioned anything aobut “posterior states of belief” and my point had nothing to do with any such thing.

What I’m saying is that if the prior is a legitimate Frequentist distribution, then Bayes theorem should give a posterior should be a good Frequentist distribution as well. Shalizi found that it wasn’t.

I agree, I’m confused by Shalizi’s example. Depending on the data generating model he assumes, he either concludes that the overall average is exactly 0 (true) or that the mixture of two gaussians are centered at -1 and +1 with 50% mixture… Did I miss something?

I take it Shalizi’s problem is that the posterior mean is not the same as the data mean. Of course this happens all the time for Bayesians. If I take some physical measurements of a coin flip, I might say the Prob(H) = .9 for the next flip while the frequency of heads in the data is Freq(H)=.5. The fact that Prob != Freq is no problem for me whatsoever. I consider it a feature of Bayesian Statistics rather than a bug.

But the prior in Shalizi’s example couldn’t have a more solid Frequentist justification (the data actually is simulated from the prior) then a Frequentist has to either:

(A) Reject the Frequentist interpretation of probabilities (accept that it’s perfectly ok if Prob != Freq)

(B) Reject the product rule of probability, or

(C) Claim the result actually does make sense from a Frequentist viewpoint.

I’d love to know which one they choose.

@Entsophy: I don’t think that Shalizi’s example is entirely well explained. I didn’t put work into figuring out exactly what he meant there but certainly if you generate data from a 50/50 mixture of two Gaussians with means -1 and 1, this isn’t the same as assuming that the data came from a *single* Gaussian where the prior says that the *parameter* comes from a Gaussian mixture such as the one that generated the data. If this is the assumption on which Shalizi’s example is based (he isn’t terribly explicit about it), no wonder that he encounters weird results, and the interpretation has to be that this is an example where *wrong Bayesian inference goes wrong*, which should bother neither the Bayesian nor the frequentist much.

Even if I got the setup wrong, the frequentist could still accept the result of Bayes’s theorem without having to claim that the posterior mode (or whatever Shalizi uses there) is a good estimator.

No it’s not well explained (or at least to my mind the Likelihood used would require some significant explanation), but I don’t actually think the posterior is as bogus as everyone seems to assume.

The prior is saying X is either near +1 or -1 and the posterior is trying to pick between those two alternatives. Or to put it another way, the posterior is doing something very different than trying to estimate the frequency distribution of the data.

Christian:

I was thinking along the same lines. For it to be a statistical problem it needs to be about something that happened or could happen in this world. So the generative model needs to be _a_ unknown was generated from the prior, given that unknown, data was repeatedly generated (or could be generated) from _a_ data model with that unknown. The priors and data model can be hierarchical and complex but there needs to be _repeatable_ sampling from something constant. (And not like Peter McCullagh’s examples were for n even, x is Normally distributed, n odd x is Cauchy distributed with n random.)

Shalizi’s example (as we seem to both casually see it) may be an interesting math puzzle, but I don’t think of any concern for the practice of statistics. This likely applies to a lot other weird examples theoretical statisticians like to talk about.

Christian: To hold on to you shirt sleeves a bit longer ;-)

> accept the result of Bayes’s theorem

Everyone has to accept Bayes’s theorem provides the posterior (given the prior, data model and data [believed to be relevant]) but what exactly to do with that posterior is still very controversial.

Also folks should not feel they can’t question the prior, data model and what data was believed to be relevant.

nesting means I have to reply here…

K?: I don’t think it has to be that *a* unknown was generated from the prior and then that unknown is used to generate repeated data from *a* data model. But the Bayesian inference problem is not very well stated (he’s skipped a lot of details on what he did, what likelihood is he using? I can’t tell).

are we trying to determine what the average value generated by this generation mechanism is? That seems to be the case in “part 2” If so, we converge on it as mean = 0 with near certainty. Shalizi seems to think this is weird. I think it’s because he’s confusing inference for the mean with inference for the distribution of data values.

The Bayesian doesn’t become “dogmatically certain that the data are distributed according to a standard Gaussian with mean 0 and variance 1” he becomes “dogmatically certain that the mean of the data generating distribution is exactly 0” which it is… so yeah I don’t get it.

If we are trying to infer the distribution of data values (as opposed to the distribution of the parameter mu), then we need to have a bayesian model *for the distribution*. For example, we might have a likelihood defined by two gaussians with unknown mean and unknown variance and unknown mixture (5 parameters) then we’d put a prior on those 4 parameters, maybe we’d get lucky and they’d be almost exactly correct:

mu1 = normal(-1,sd=.1), var1 = chisq(1), mu2 = normal(1,sd=.1), var2=chisq(1), p1=1/2 (implies p2=1/2).

I think if we did this, we’d find that sure enough this model converges on reality really fast.

The point is, if you’re trying to infer the mean, you need a model for the mean, if you’re trying to infer the distribution, you need a model for the distribution! the fact that a model for the mean comes to the certain conclusion that the mean = 0 is apparently confusing to Shalizi because he says that the data doesn’t have a distribution that is a delta function about 0. But the mean does, and that’s what he’s inferred.

I think Shalizi is smarter than this, so I am confused by what his point is. Perhaps he just didn’t quite think it out in this case.

http://delong.typepad.com/sdj/2009/03/cosma-shalizi-takes-me-to-probability-school-or-is-it-philosophy-school.html

That article, inspired by Shalizi’s example, shows better I think what is wrong. The Guildenstern AI has a totally unacceptable likelihood. It’s got a delta function at p = 1/3 and p=2/3. In a situation where it can NEVER be right it is always wrong… uh. great thanks. If instead of delta functions the Guildenstern AI had chosen beta distributions with incredibly tight distributions around 1/3 and 2/3, say 50% mixture of beta(2/3 * N,1/3*N) and beta(1/3 * N, 2/3 *N) with N = 1 million, after a potentially long while Guildenstern would be correct.

Have I learned anything? Bad Bayesian inference is bad?

Daniel:

> you might still be interested in a small set of possible groups (ie. a mixture model).

In my original comment, I included “The priors and data model can be hierarchical [mixture model] and complex” but the prior and data model needs to be coupled (not just any prior times any data model).

> DEAD SURE that a Bayesian approach will be very fruitful

My _claim_ is: all that can go wrong is that the _joint_ model (either prior or data model) may not represent reality well enough or the posterior is _miss-processed_ (which can be very subtle).

But the nesting and limitations of blogging are likely causing a fair amount of confusion.

Daniel:

Not that it “has to be that *a* unknown was generated from the prior and then that unknown is used to generate repeated data from *a* data model.”

But that is the only thing I would be interested in (e.g. use as a representation to address an empirical question).

I know very smart people are interested more general representations than that (e.g. countable and uncountable infinities) but I am choosing not to be interested them other than as convenient approximate representations.

I dunno, even if you’re not interested in countable and uncountable infinities (which for the most part I’m not) you might still be interested in a small set of possible groups (ie. a mixture model). To give a simple example:

There are 4 suppliers of concrete that are used in foobar county. A forensic engineer is interested in the stability of foundations in buildings he is inspecting. There are no records available of which foundations were poured by which concrete suppliers. After inspecting 300 buildings there are 900 concrete cylinder crush tests from cylinders cored out of foundations. And there are approximate dates available for the pour dates of these foundations.

Going to the suppliers, the forensic engineer subpoenas the history of concrete cylinder crush tests performed by the suppliers through time. The engineer can now estimate the mean crush strength of concrete poured by each supplier as a function of time.

The goal now is to find out what we can about which foundations of which buildings were poured by which suppliers and what were the average crush strengths of those foundations.

It’s a more complex example of some data generating mechanism that is qualitatively similar to the one Shalizi’s example spits out. But I am pretty much DEAD SURE that a Bayesian approach will be very fruitful in this example.

I think there is great convenience in the Bayesian approach in problems where the model can only be partially identified. For instance, an arbitrary change in how time effects are modelled in a panel data model can flip the coefficient sign. Thinking of the assumptions ruling how time dependence is modelled as each being plausible scenarios with a given probabitity and the models themselves as dimensions of a big joint posterior is an attractive way to fight off ill-posed statistical inference problems.

[…] Economists argue about Bayes (andrewgelman.com) […]