My impression is that although the Bayesian/Frequentist debate is interesting and intellectually fun, there’s really not much “there” there… despite being so-hip-right-now, Bayesian is not the Statistical Jesus.
I’m happy to see the discussion going in this direction. Twenty-five years ago or so, when I got into this biz, there were some serious anti-Bayesian attitudes floating around in mainstream statistics. Discussions in the journals sometimes devolved into debates of the form, “Bayesians: knaves or fools?”. You’d get all sorts of free-floating skepticism about any prior distribution at all, even while people were accepting without question (and doing theory on) logistic regressions, proportional hazards models, and all sorts of strong strong models. (In the subfield of survey sampling, various prominent researchers would refuse to model the data at all while having no problem treating nominal sampling problems as if they were real (despite this sort of evidence to the contrary). Meanwhile, many of the most prominent Bayesians seemed to spend more time talking about Bayesianism than actually doing statistics, while the more applied Bayesians often didn’t seem very Bayesian at all (see, for example, the book by Box and Tiao from the early 1970s which was still the standard work on the topic for many years after). Those were the dark days, when even to do Bayesian inference (outside of some small set of fenced-in topics such as genetics that had very clear prior distributions) made you suspect in some quarters.
So really, no joke, I think we’ve made a lot of progress as a field. Bayesian methods are not only accepted, they’re thriving to the extent that in many cases the strongest argument against Bayes is that it’s not universally wonderful (a point with which I agree; see yesterday’s discussion).
Another sign of our progress is the direction of much non-Bayesian work. As noted above, a lot of old-style Bayesian work didn’t look particularly Bayesian. Nowadays, it’s the opposite: non-Bayeisan work in areas such as wavelets, lasso, etc., are full of regularization ideas that are central to Bayes as well. Or, consider work in multiple comparisons, a problem that Bayesians attack using hierarchical models. And non-Bayesians use the false discovery rate, which has many similarities to the Bayesian approach (as has been noted by Efron and others). This really is a change. Back in the old days, classical multiple comparisons was all about experimentwise error rates and complicated p-value accounting. The field really has moved forward, and indeed one reason why I don’t think Bayesian methods are always so necessary is that non-Bayesian methods use similar ideas. You could make similar statements about machine-learning problems such as speech recognition, or (to take an example closer to De Long and Smith’s field of economics) the study of varying treatment effects.
Take-home message for economists
One thing I’d like economists to get out of this discussion is: statistical ideas matter. To use Smith’s terminology, there is a there there. P-values are not the foundation of all statistics (indeed analysis of p-values can lead people seriously astray). A statistically significant pattern doesn’t always map to the real world in the way that people claim.
Indeed, I’m down on the model of social science in which you try to “prove something” via statistical significance. I prefer the paradigm of exploration and understanding. (See here for an elaboration of this point in the context of a recent controversial example published in an econ journal.)
Here’s another example (also from economics) where the old-style paradigm of each-study-should-stand-on-its-own led to troubles.
A lot of the best statistical methods out there—however labeled—work by combining lots of information and modeling the resulting variation. And these methods are not standing still; there’s a lot of research going on right now on topics such as weakly informative priors and hierarchical models for deep interactions (and corresponding non-Bayesian approaches to regularization).
The case of weak data
Smith does get one thing wrong. He writes:
When you have a bit of data, but not much, Frequentist – at least, the classical type of hypothesis testing – basically just throws up its hands and says “We don’t know.” It provides no guidance one way or another as to how to proceed.
If only that were the case! Instead, hypothesis testing typically means that you do what’s necessary to get statistical significance, then you make a very strong claim that might make no sense at all. Statistically significant but stupid. Or, conversely, you slice the data up into little pieces so that no single piece is statistically significant, and then act as if the effect you’re studying is zero. The sad story of conventional hypothesis testing is that it is all to quick to run with a statistically significant result even if it’s coming from noise. In many problems, Bayes is about regularization—it’s about pulling unreasonable, noisy estimates down to something sensible.
Smith elaborates and makes another mistake, writing:
If I have a strong prior, and crappy data, in Bayesian I know exactly what to do; I stick with my priors. In Frequentist, nobody tells me what to do, but what I’ll probably do is weaken my prior based on the fact that I couldn’t find strong support for it.
This isn’t quite right, for three reasons. First, a Bayesian doesn’t need to stick with his or her priors, any more than any scientist needs to stick with his or her model. It’s fine—indeed, recommended—to abandon or alter a model that produces implications that don’t make sense (see my paper with Shalizi for a wordy discussion of this point). Second, the parallelism between “prior” and “data” isn’t quite appropriate. You need a model to link your data to your parameters of interest. It’s a common (and unfortunate) practice in statistics to forget about this model, but of course it could be wrong too. Economists know about this, they do lots of specification checks. Third, if you have weak data and your prior is informative, this does not imply that your prior should be weakened! If my prior reading of the literature suggests that a parameter theta should be between -0.3 and +0.3, and then I get some data that are consistent with theta being somewhere between -4 and +12, then, sure, this current dataset does not represent “strong support” for the prior—but that does not mean there’s a problem with the prior, it just means that the prior represents a lot more information than you have at hand.
I very much respect the idea of data reduction and summarizing the information from any particular study without prejudice, but if I have to make a decision or a scientific inference, then I see no reason to rely on whatever small dataset happens to be in my viewer right now.
In that sense, I think it would be helpful to separate “the information in a dataset” from “one’s best inference after having seen the data.” If people want to give pure data summaries with no prior, that’s fine. But when they jump to making generalizable statements about the world, I don’t see it. That was the problem, for example, with that paper about the sexes of the children of beautiful and ugly parents. No matter how kosher the data summary was (and, actually, in that case the published analysis had problems even as a classical data summary), the punchline of the paper was a generalization about the population—an inference. And, there, yes, the scientific literature on sex ratios was indeed much more informative than one particular survey of 3000 people.
Similarly with Nate Silver’s analysis. Any given poll might be conducted impeccably. Still, there’s a lot more information in the mass of polls than in any single survey. So, to the extent that “Bayesian” is associated with using additional information rather than relying on a single dataset, I see why Nate is happy to associate himself with that label.
To put it another way: to the non-Bayesian, a Bayesian is someone who pollutes clean data with a subjective prior distribution. But, to the Bayesian, a classical statistician is someone who arbitrarily partitions all the available information into something called “the data” which can be analyzed and something called “prior information” which is off limits. Again, I see that this can be a useful principle for creating data summaries (each polling organization can report its own numbers based on its own data) but it doesn’t make a lot of sense to me if the goal is decision making or scientific inference.