(This is not a paper we wrote by mistake.) (This is also not Andrew) (This is also really a blog about an aspect of the paper, which mostly focusses on issues around visualisation and how visualisation can improve workflow. So you should read it.) Recently Australians have been living through a predictably ugly debate around […]

**\" data twice\"**

## Discussion with Sander Greenland on posterior predictive checks

Sander Greenland is a leading epidemiologist and educator who’s strongly influenced my thinking on hierarchical models by pointing out that often the data do not supply much information for estimating the group-level variance, a problem that can be particularly severe when the number of groups is low. (And, in some sense, the number of groups […]

## Everything is Obvious (once you know the answer)

Duncan Watts gave his new book the above title, reflecting his irritation with those annoying people who, upon hearing of the latest social science research, reply with: Duh-I-knew-that. (I don’t know how to say Duh in Australian; maybe someone can translate that for me?) I, like Duncan, am easily irritated, and I looked forward to reading the book. I enjoyed it a lot, even though it has only one graph, and that graph has a problem with its y-axis. (OK, the book also has two diagrams and a graph of fake data, but that doesn’t count.)

Before going on, let me say that I agree wholeheartedly with Duncan’s central point: social science research findings are often surprising, but the best results cause us to rethink our world in such a way that they seem completely obvious, in retrospect. (Don Rubin used to tell us that there’s no such thing as a “paradox”: once you fully understand a phenomenon, it should not seem paradoxical any more. When learning science, we sometimes speak of training our intuitions.) I’ve jumped to enough wrong conclusions in my applied research to realize that lots of things can seem obvious but be completely wrong. In his book, Duncan does a great job at describing several areas of research with which he’s been involved, explaining why this research is important for the world (not just a set of intellectual amusements) and why it’s not as obvious as one might think at first.

## Silly baseball example illustrates a couple of key ideas they don’t usually teach you in statistics class

From a commenter on the web, 21 May 2010: Tampa Bay: Playing .732 ball in the toughest division in baseball, wiped their feet on NY twice. If they sweep Houston, which seems pretty likely, they will be at .750, which I [the commenter] have never heard of. At the time of that posting, the Rays […]

## The old, old story: Effective graphics for conveying information vs. effective graphics for grabbing your attention

One thing that I remember from reading Bill James every year in the mid-80’s was that certain topics came up over and over, issues that would never really be resolved but appeared in all sorts of different situations. (For Bill James, these topics included the so-called Pesky/Stuart comparison of players who had different areas of strength, the eternal question (associated with Whitey Herzog) of the value of foot speed on offense and defense, and the mystery of exactly what it is that good managers do.)

Similarly, on this blog–or, more generally, in my experiences as a statistician–certain unresolvable issues come up now and again. I’m not thinking here of things that I know and enjoy explaining to others (the secret weapon, Mister P, graphs instead of tables, and the like) or even points of persistent confusion that I keep feeling the need to clean up (No, Bayesian model checking does not “use the data twice”; No, Bayesian data analysis is not particularly “subjective”; Yes, statistical graphics can be particularly effective when done in the context of a fitted model; etc.). Rather, I’m thinking about certain tradeoffs that may well be inevitable and inherent in the statistical enterprise.

Which brings me to this week’s example.

## Confusion about Bayesian model checking

As regular readers of this space should be aware, Bayesian model checking is very important to me:

1. Bayesian inference can make strong claims, and, without the safety valve of model checking, many of these claims will be ridiculous. To put it another way, particular Bayesian inferences are often clearly wrong, and I want a mechanism for identifying and dealing with these problems. **I certainly don’t want to return to the circa-1990 status quo in Bayesian statistics, in which it was considered virtually illegal to check your model’s fit to data.**

2. Looking at it from the other direction, model checking can become much more effective in the context of complex Bayesian models (see here and here, two papers that I just love, even though, at least as measured by citations, they haven’t influenced many others).

On occasion, direct Bayesian model checking has been criticized from a misguided “don’t use the data twice” perspective (which I won’t discuss here beyond referring to this blog entry and this article of mine arguing the point).

Here I want to talk about something different: a particular attempted refutation of Bayesian model checking that I’ve come across now and then, most recently an a blog comment by Ajg:

The example [of the proportion of heads in a number of “fair tosses”] is the most deeply damning example for any straightforward proposal that probability assertions are falsifiable.

The probabilistic claim “T” that “p(heads) = 1/2, tosses are independent” is very special in that it, in itself, gives no grounds for preferring any one sequence of N predictions over another: HHHHHH…, HTHTHT…, etc: all have identical probability .5^N and indeed this equality-of-all-possibilities is the very content of “T”. There is simply nothing inherent in theory “T” that could justify saying that HHHHHH… ‘falsifies’ T in some way that some other observed sequence HTHTHT… doesn’t, because T gives no (and in fact, explicitly denies that it could give any) basis for differentiating them.

## On citation practices, strategic and otherwise

John Sides links to an (unintentionally, I assume) hilarious peer-reviewed article by C. K. Rowley, which begins:

## Why I don’t like Bayesian statistics

Clarification: Somebody pointed out that, when people come here from a web search, they won’t realize that it’s an April Fool’s joke. See here for my article in Bayesian analysis that expands on the blog entry below, along with discussion by four statisticians and a rejoinder by myself that responds to the criticisms that I […]

## Those people who go around telling you not to do posterior predictive checks

I started to post this item on posterior predictive checks and then I realize I already did post it several months ago! Memories (including my own) are short, though, so here it is again:

A researcher writes,

I have made use of the material in Ch. 6 of your Bayesian Data Analysis book to help select among candidate models for inference in risk analysis. In doing so, I have received some criticism from an anonymous reviewer that I don’t quite understand, and was wondering if you have perhaps run into this criticism. Here’s the setting. I have observable events occurring in time, and I need to choose between a homogeneous Poisson process, and a nonhomogeneous Poisson process, in which the rate is a function of time ( e.g., lognlinear model for the rate, which I’ll call lambda).

I could use DIC to select between a model with constant lambda and one where the log of lambda is a linear function of time. However, I decided to try to come up with an approach that would appeal to my frequentist friends, who are more familiar with a chi-square test against the null hypothesis of constant lambda. So, following your approach in Ch. 6, I had WinBUGS compute two posterior distributions. The first, which I call the observed chi-square, subtracts the posterior mean (mu[i] = lambda[i]*t[i]) from each observed value, square this, and divides by the mean. I then add all of these values up, getting a distribution for the total. I then do the same thing, but with draws from the posterior predictive distribution of X. I call this the replicated chi-square statistic.

If my putative model has good predictive validity, it seems that the observed and replicated distributions should have substantial overlap. I called this overlap (calculated with the step funtion in WinBUGS) a “Bayesian p-value.” The model with the larger p-value is a better fit, just like my frequentist friends are used to.

Now to the criticism. An anonymous reviewer suggests this approach is weakened by “using the observed data twice.” Well, yes, I do use the observed data to estimate the posterior distribution of mu, and then I use it again to calculate a statistic. However, I don’t see how this is a problem, in the sense that empirical Bayes is problematic to some because it uses the data first to estimate a prior distribution, then again to update that prior. I am also not interested in “degrees of freedom” in the usual sense associated with MLEs either.

I am tempted to just write this off as a confused reviewer, but I am not an expert in this area, so I thought I would see if I am missing something. I appreciate any light you can shed on this problem.

My thoughts:

## Controversies over posterior predictive checks

This is a long one, but it’s good stuff (if you like this sort of thing). Dana Kelly writes,

We’ve corresponded on this issue in the past, and I mentioned that I had been taken to task by a referee who claimed that model checks that rely on the posterior predictive distribution are invalid because they use the data twice.

## Bayesian model selection

A researcher writes,

I have made use of the material in Ch. 6 of your Bayesian Data Analysis book to help select among candidate models for inference in risk analysis. In doing so, I have received some criticism from an anonymous reviewer that I don’t quite understand, and was wondering if you have perhaps run into this criticism. Here’s the setting. I have observable events occurring in time, and I need to choose between a homogeneous Poisson process, and a nonhomogeneous Poisson process, in which the rate is a function of time ( e.g., lognlinear model for the rate, which I’ll call lambda).

I could use DIC to select between a model with constant lambda and one where the log of lambda is a linear function of time. However, I decided to try to come up with an approach that would appeal to my frequentist friends, who are more familiar with a chi-square test against the null hypothesis of constant lambda. So, following your approach in Ch. 6, I had WinBUGS compute two posterior distributions. The first, which I call the observed chi-square, subtracts the posterior mean (mu[i] = lambda[i]*t[i]) from each observed value, square this, and divides by the mean. I then add all of these values up, getting a distribution for the total. I then do the same thing, but with draws from the posterior predictive distribution of X. I call this the replicated chi-square statistic.

If my putative model has good predictive validity, it seems that the observed and replicated distributions should have substantial overlap. I called this overlap (calculated with the step funtion in WinBUGS) a “Bayesian p-value.” The model with the larger p-value is a better fit, just like my frequentist friends are used to.

Now to the criticism. An anonymous reviewer suggests this approach is weakened by “using the observed data twice.” Well, yes, I do use the observed data to estimate the posterior distribution of mu, and then I use it again to calculate a statistic. However, I don’t see how this is a problem, in the sense that empirical Bayes is problematic to some because it uses the data first to estimate a prior distribution, then again to update that prior. I am also not interested in “degrees of freedom” in the usual sense associated with MLEs either.

I am tempted to just write this off as a confused reviewer, but I am not an expert in this area, so I thought I would see if I am missing something. I appreciate any light you can shed on this problem.

My thoughts: