Bayesian Statistics Then and Now

I happened to recently reread this article of mine from 2010, and I absolutely love it. I don’t think it’s been read by many people—it was published as one of three discussions of an article by Brad Efron in Statistical Science—so I wanted to share it with you again here.

This is the article where I introduce three meta-prinicples of statistics:

The information principle: the key to a good statistical method is not its underlying philosophy or mathematical reasoning, but rather what information the method allows us to use. Good methods make use of more information.

The methodological attribution problem: the many useful contributions of a good statistical consultant, or collaborator, will often be attributed to the statistician’s methods or philosophy rather than to the artful efforts of the statistician himself or herself. I give as examples Rubin, Efron, and Pearl.

Different applications demand different philosophies, demonstrated with a discussion of the assumptions underlying the so-called so-called false discovery rate.

Also this:

I also think that Efron is doing parametric Bayesian inference a disservice by focusing on a fun little baseball example that he and Morris worked on 35 years ago. If he would look at what is being done now, he would see all the good statistical practice that he naively (I think) attributes to “frequentism.”

As I get older, I too rely on increasingly outdated examples. So I can see how this can happen but we should be aware of it.

Also this:

I also completely disagree with Efron’s claim that frequentism (whatever that is) is “fundamentally conservative.” One thing that “frequentism” absolutely encourages is for people to use horrible, noisy estimates out of a fear of “bias.”

That’s a point I’ve been banging on a lot recently, the idea that people make all kinds of sacrifices and put their data into all sorts of contortions in order to feel that they are using a rigorous, unbiased procedure.

And:

as discussed by Gelman and Jakulin (2007) [here], Bayesian inference is conservative in that it goes with what is already known, unless the new data force a change. In contrast, unbiased estimates and other unregularized classical procedures are noisy and get jerked around by whatever data happen to come by—not really a conservative thing at all.

Back in 2010 I didn’t know about the garden of forking paths, but since then I’ve seen, over and over, how classical inference based on p-values and confidence intervals (or, equivalently, noninformative priors) leads to all sorts of jumping to conclusions. Power pose, embodied cognition, himmicanes: these are all poster boys for the anticonservatism of classical statistics and the potential of Bayesian inference to bring us to some sort of sanity.

And this comment on “frequentism” more generally:

Of course, frequentism is a big tent and can be interpreted to include all sorts of estimates, up to and including whatever Bayesian thing I happen to be doing this week—to make any estimate “frequentist,” one just needs to do whatever combination of theory and simulation is necessary to get a sense of my method’s performance under repeated sampling. So maybe Efron and I are in agreement in practice, that any method is worth considering if it works, but it might take some work to see if something really does indeed work.

“It might take some work to see if something really does indeed work”: I like that. Clever and also true, and also it leaves an opening for statistical theory and, for that matter, frequentism: we want to see if something really does indeed work. Or, to put it more precisely, to understand the domains in which a method will work, and to also understand where and how it will fail.

Lots more good stuff in this 4-page article. As the saying goes, read the whole thing.

11 thoughts on “Bayesian Statistics Then and Now

  1. > Good methods make use of more information.

    How should one practice regression discontinuity designs and other local average treatment effect-estimating designs in light of this ethos?

    • Lauren:

      My tip here would be to control for additional pre-treatment variables in this sort of observational study rather than to discard that information based on naive belief that your identification problems are all solved from the RD. This article I wrote with Zelizer discusses a notorious example where researchers did a really bad RD analysis, in part I think because they thought that the discontinuity design gave them the freedom to not worry about all the important concerns in observational studies.

  2. “As I get older, I too rely on increasingly outdated examples.”

    The problem with an older example is not with the example itself (which, if you are still using it years later, is almost certainly a good one).

    The problem is that it gives the impression that, in the years since, you have not have found another good instance in which the model/technique/principle applies. This creates the impression that the model/technique/principle has only rare applicability.

  3. Thanks for the reminder, Andrew. I had forgotten where your methodological attribution problem statement appeared in print.

    On the second page, you write “… where everything is connected to everything else….” That’s a handy catchphrase, but I wonder at its utility. Unless you’re into astrology, the location of Pluto is not really connected to whether I should drink tea or coffee this morning nor whether a particular species is at serious risk of extinction, since you mentioned ecology. I like the late Dartmouth professor Barry Richmond’s exhortation to “challenge the clouds,” where the clouds represent the boundaries of the model you’re working with: how will your conclusions change if you expand–or contract–the boundaries of your model significantly? That gives actionable advice, and it, as you often do, moves away from a focus on “truth” (“Is X connected to Y?”) and towards a focus on useful learning (“What do we learn if we assume that X is connected to Y?”) (not that truth is a bad thing).

    Your last paragraph starts with what seems like a run-on sentence: “On one hand, I am impressed by modern machine-learning
    methods that process huge datasets with I agree with Kass’s concluding remarks….” If so, do you recall what you meant to follow “huge datasets with”?

    Finally, any comments about the nice panel of maps seen in the light of the paper you and Phil wrote about maps?

    • Bill:

      Here’s what I wrote:

      . . . in my own work, I never study zero effects. The effects I study are sometimes small but it would be silly, for example, to suppose that the difference in voting patterns of men and women (after controlling for some other variables) could be exactly zero. My problems with the “false dis- covery” formulation are partly a matter of taste, I’m sure, but I believe they also arise from the difference between problems in genetics (in which some genes really have essentially zero effects on some traits, so that the classical hypothesis-testing model is plausible) and in social science and environmental health (where essentially everything is connected to everything else, and effect sizes follow a continuous distribution rather than a mix of large effects and near-exact zeroes).

      1. I don’t study astrology. I agree that zero effects exist, but these are not the sort of thing that I and other serious social scientists spend time studying. When Gertler et al. wrote that paper estimating the effects of early childhood stimulation in Jamaica, I expressed skepticism at their effect size estimates, but I have no doubt that the effects are nonzero. Indeed, the concern is that the effects are highly variable and so they’re hard to study. But that’s not the same as thinking they’re zero. We’re not studying astrology here.

      2. In that sentence about Kass, the copy-editors introduced an error in the production. That “with” should be “and”.

      3. I have been meaning for a long time to write a paper with Yair about all these graphs.

  4. >>>Good methods make use of more information.<<<

    What's your definition of "good"? Isn't it quite common in, say, predictive contests for the winning algorithm to consciously discard a part of the information available?

    Does the best algorithm always have to use all the "features"?

    Why should one rate a method based on its voracious appetite for inputs?

    • I think the key distinction might be that “information” is not the same as data or features, implying a relevance/context and an efficiency. Sort of an information theory kind of thing.

      In that sense, a “bad” feature may contain no information that’s relevant to the task at hand. For example, a column of random numbers. While a “good”, engineered feature can be more useful than the raw features from which it was engineered — at least in the context of the particular task and the particular prediction algorithm being used.

      As in the old Data, Information, Insight spectrum.

Leave a Reply to Bob Carpenter Cancel reply

Your email address will not be published. Required fields are marked *