Nooooooo, just make it stop, please!

Dan Kahan wrote:

You should do a blog on this.

I replied: I don’t like this article but I don’t really see the point in blogging on it. Why bother?


BECAUSE YOU REALLY NEVER HAVE EXPLAINED WHY. Gelman-Rubin criticque of BIC is *not* responsive; you have something in mind—tell us what, pls! Inquiring minds what to know.

Me: Wait, are you saying it’s not clear to you why I should hate that paper??



Certainly what say about “model selection” aspects of BIC in Gelman-Rubin don’t apply.

Me: OK, OK. . . . The paper is called, Bayesian Benefits for the Pragmatic Researcher, and it’s by some authors whom I like and respect, but I don’t like what they’re doing. Here’s their abstract:

The practical advantages of Bayesian inference are demonstrated here through two concrete examples. In the first example, we wish to learn about a criminal’s IQ: a problem of parameter estimation. In the second example, we wish to quantify and track support in favor of the null hypothesis that Adam Sandler movies are profitable regardless of their quality: a problem of hypothesis testing. The Bayesian approach unifies both problems within a coherent predictive framework, in which parameters and models that predict the data successfully receive a boost in plausibility, whereas parameters and models that predict poorly suffer a decline. Our examples demonstrate how Bayesian analyses can be more informative, more elegant, and more flexible than the orthodox methodology that remains dominant within the field of psychology.

And here’s what I don’t like:

Their first example is fine, it’s straightforward Bayesian inference with a linear model, it’s almost ok except that they include a bizarre uniform distribution as part of their prior. But here’s the part I really don’t like. After listing seven properties of the Bayesian posterior distribution, they write, “none of the statements above—not a single one—can be arrived at within the framework of orthodox methods.” That’s just wrong. In classical statistics, this sort of Bayesian inference falls into the category of “prediction.” We discuss this briefly in a footnote somewhere in BDA. Classical “predictive inference” is Bayesian inference conditional on hyperparameters, which is what’s being done in that example. A classical predictive interval is not the same thing as a classical confidence interval, and a classical unbiased prediction is not the same thing as a classical unbiased estimate. The key difference: when a classical statistician talks about “prediction,” this means that the true value of the unknown quantity (the “prediction”) is not being conditioned on. Don’t get me wrong, I think Bayesian inference is great; I just think it’s silly to say that these methods don’t exist with orthodox methods.

Their second example, I hate. It’s that horrible hypothesis testing thing. They write, “The South Park hypothesis (H0) posits that there is no correlation (ρ) between box-office success and “fresh” ratings—H0: ρ = 0.” OK, it’s a joke, I get that. But, within the context of the example, no. No. Nononononono. It makes no sense. The correlation is not zero. None of this makes any sense. It’s a misguided attempt to cram a problem into an inappropriate hypothesis testing framework.

I have a lot of respect for the authors of this paper. They’re smart and thoughtful people. In this case, though, I think they’re in a hopeless position.

I do agree with Kahan that the problem of adjudicating between scientific hypotheses is important. I just don’t think this is the right way to do it. If you want to adjudicate between scientific hypotheses, I prefer the approach of continuous model expansion: building a larger model that includes the separate models as separate cases. Forget Wald, Savage, etc., and start from scratch.

36 thoughts on “Nooooooo, just make it stop, please!

  1. Regarding their first example, I am also puzzled by their uniform distribution for the standard deviation prior. Why is it necessary to assume a prior distribution for the standard deviation at all? Couldn’t they just assume that the previous literature (which they cite as showing a standard deviation of IQ scores = 12) is the prior?

    As a non-Bayesian (by training, not belief), I have been working on simple and clear examples illustrating the difference between classical statistical reasoning and Bayesian analysis. Their first example comes closer than most of what I have seen. I’ve constructed my own version of their first example but I don’t see any reason to assume that the standard deviation of IQ scores follows a distribution rather than just using the standard deviation estimate based on prior studies. Can someone explain whether it is necessary to assume that the standard deviation follows a non-degenerate distribution, and if so, why that is necessary for illustrating the difference in techniques?

    • Because you are not absolutely certain that the standard deviation of IQ scores equals 12 (additional question: does previous research indicate a standard deviation of 12 for tests overall, or the particular tests used on that hypothetical prisoner, eg WAIS IV?), under the general principal of building a model to fit your understanding of the world as closely as possible, it makes sense to express your uncertainty in its actual value with a probability distribution.
      Pragmatically, this should lead to more honest, less over-confident, inference. Although given an informative prior, the difference would be much smaller (although of the same spirit) as using a t vs normal distribution for a sample average.

      When fitting models in Stan, this almost isn’t any extra work over using a point estimate in the first place.

    • The quantity they are representing with the uniform distribution is the measurement error inherent in the IQ testing procedure. So, for example if you made up 10 separate IQ tests and gave them to a single person over a period of a month or so, you’d see some variation. That’s not the same variation as if you give a single test to 10 different people, which is the variability given as 12.

      In practice for this problem, most likely the uniform distribution isn’t that bad. The measurement error can’t be much bigger than the variability across people, and it logically can’t be smaller than zero. So uniform(0,15) covers pretty much everything. Shortening that interval a little on the bottom side makes sense.

      In practice for positive scale parameters I usually use a gamma distribution gamma(n,n/m) where n is an “effective sample size” and m is an expected value for the parameter… gamma(3,3/5) seems like it’d be pretty good for an IQ test where we’ve observed sd=12 across multiple people in the past.

      • Thank you for these helpful comments. As a pedagogical matter, I think it is better to start with a degenerate value for the standard deviation (and, to Daniel’s point, I think it is good to start with 12 with the understanding that the variability for an individual is not likely to be greater than the variability across individuals). Then, as a next step, I would introduce uncertainty in the standard deviation. Jumping to a more general solution does not help develop an appreciation for how the analysis differs from a classical statistical analysis (in my opinion). It also helps me understand how prior information can be included. I was actually somewhat confused about what standard deviation to use since the literature is based on variability across individuals not individual variability over tests (there may well be research on both that can be brought to bear). But is precisely the richness of these decisions that I find appealing about Bayesian analysis.

  2. >If you want to adjudicate between scientific hypotheses, I prefer the approach of continuous model expansion: building a larger model that includes the separate models as separate cases.

    Can you recommend an applied paper that does scientific hypothesis adjudication with this approach?

  3. Its the apparent strategic message that bothers me which I perceive as just adopt Bayesian methods (having learned just some of the basics) and its a bed of roses to get sensible and credible analyses (that the scientific community will profit from).

    A less wrong message would be that learning from data is risky but persistence in grasping the context of the data and how it came about may give rise to models and techniques that sufficiently reduce that risk to make it worthwhile. With that, and adequate expertise, Bayesian and orthodox methods both can likely be made to work well.

    Also reminds me of Parizeau’s lobster strategy – get folks to choose your side then like lobsters thrown into boiling water they can’t get out.

    As for the predictive interpretation I recalled this from 10 year ago

  4. “The correlation is not zero.”

    I don’t think they’re saying it is; they’re saying that this is one of the hypotheses under consideration, and they evaluate the evidence for it. Are you just forcefully asserting your prior that the South Park hypothesis is false?

    • Yes, we can be absolutely sure the correlation is not 0.0000000000000000000…..

      Conceivably it could be very low, but it cannot literally be 0. There is no point testing a hypothesis that you know to be false.

      • But that is true for every point hypothesis. That invites two more questions:
        1) Is the issue indeed with point hypotheses and would everyone be fine with this if the “rho = 0” were replaced by “rho ~ U(-e,e)” with e small?
        2) Isn’t every model M a point hypothesis with respect to some higher-order model M’ (i.e., where M’ expands M by including an additional continuous parameter whose value can be fixed to some value in order to obtain M)? How many turtles deep is the right turtle depth?

        • Joachim:

          Regarding point 1, I don’t see why there’s interest in whether rho is between -.01 and .01, or whatever.

          Regarding point 2, yes, I think the bigger model should be better but there is a cost in expanding the model, hence I do model checking to decide where the model should be expanded. There are lots of open questions here and room for more research; I just don’t really like the methods proposed in the paper discussed above.

        • I’m just curious about the source of your dislike if it’s about the methods — I hear an argument about some hypotheses not being interesting to you and that’s fine, but what’s methodologically wrong with assigning probabilities to qualitatively different models? How would you formally check if the zero-correlation model needs to be expanded to include a non-zero correlation?

        • Joachim,

          The problem with these Bayesian model averaging methods is that the marginal likelihood of a model typically depends very strongly on aspects of the model that are set arbitrarily. For example, change your prior on a parameter from normal(0,100) to normal(0,1000) and your marginal likelihood changes by a factor of 10. I discuss this further in chapter 7 of BDA3 (or chapter 6 of the earlier editions) and in my 1995 paper with Rubin.

        • If the prior and marginal likelihood are arbitrary, then so are the predictions made by the model. It seems to me that if the predictions made by a model are arbitrary, then the model is poorly defined and you have bigger fish to fry. I don’t think the dependence on prior information is a weakness of method — it is simply a property of inference.

          I don’t see how one can avoid the dependence on assumptions in the formal model check that is needed to determine whether a model needs expanding (since the expanded model under consideration would have priors also), so I’d be curious to learn how you do model checking.

          That said, I understand that this is a matter of taste. I personally think “robustness against prior information” more often indicates a pathological model than a virtuous one.

        • Joachim:

          No, changing the prior in this way will have essentially zero impact on inferences and predictions conditional on the model. As I discuss in chapter 7 of BDA, I think the right thing to do in these settings is to build a larger model that includes the individual models as special cases, that is, continuous model expansion.

        • We may be talking about different scenarios. To riff on the example in the paper: if I’m measuring someone’s IQ, then “A: IQ ~ N(100,15)” and “B: IQ ~ N(100,100)” encode very different expectations and the prior predictive of their IQ score will be very different (suppose I want to take some action if I’m confident that the IQ is in a certain range). Importantly, one of these assumptions makes a lot more sense than the other!

          I don’t have BDA handy right now, but what procedure would you recommend to evaluate if a continuous model expansion was worth it?

        • Joachim,

          I don’t really have it in me to explain it all again here, but very briefly: Think about two regression models, one is y = X*beta + error, the other is y = W*gamma + error, where X and W are two different (possibly overlapping) sets of predictors. I’m assuming things are roughly on unit scale so all the coefficients are well below 10 in absolute value, I’m also assuming no collinearity and a reasonable amount of data so that either regression can be estimated with no problem. Various priors for beta or for gamma will be essentially equivalent, for example independent priors on the components of beta with mean 0 and sd 10 or sd 100 or sd 1000 or whatever, and the same sort of thing for gamma. Changing these prior sd’s from 100 to 1000 won’t matter except for either model but it will have a huge effect on the marginal likelihood. So, to start with, if you want to do Bayesian model averaging you have to be really really serious about the priors, even in a large N, small error setting where otherwise the priors won’t matter.

          For continuous model expansion I’d want to fit a model including all the X’s and W’s, adding prior info to the coefs as necessary.

        • I’m sorry, this sentence seems critical and I don’t understand what you mean:

          “Changing these prior sd’s from 100 to 1000 won’t matter except for either model but it will have a huge effect on the marginal likelihood.”

          You’re saying it does matter for the marginal likelihood, and for either model, but it won’t matter for… what else? Each model’s predictions are made by the marginal likelihood, which is affected by the prior. Therefore, a model’s predictions are affected by the prior. Therefore, for a model to be well-specified, the prior must be specified.

          Taking your example (sigmas are variances):
          H1: y ~ N(beta*X, sigma)
          H0: y ~ N(gamma*W, sigma)

          We can simplify for the sake of discussion, and say that W = 0, X = 1, y = 0, and sigma = 1. Gamma no longer matters, and let’s give beta a prior: beta ~ N(0, tau).

          The predictive distribution for y under H0 is now (y|H0) ~ N(0, sigma) and under H1 it is y ~ N(0, sigma+tau). The latter includes the prior. In order to evaluate whether the added parameter was worth adding, the suggested approach simply moves along and compares the support at y under H0 to the support under H1:
          N(0, sigma+tau)/N(0, sigma) = 1/(1+tau)

          … still a function of the prior. I understand that you have to be serious about priors, but I have no idea how one could make inferences in which the model assumptions don’t feature. That just seems wrong.

          Perhaps I am not understanding the difference between the actions “doing model selection” and “deciding if model expansion is warranted”?

          I’ll check out BDA3 Ch.7 today.

        • Joachim:

          I garbled that sentence where I wrote, “Changing these prior sd’s from 100 to 1000 won’t matter except for either model but it will have a huge effect on the marginal likelihood.”

          Here it is, ungarbled (and slightly lengthened for clarity): “Changing these prior sd’s from 100 to 1000 will have essentially no effect on the inferences conditional on either model but it will have a huge effect on the marginal likelihoods.”

        • It’s not true that every point hypothesis is false. The rest mass of the positron could be exactly equal to that of the electron. The speed of light in a vacuum could be exactly equal for every observer. There aren’t many examples like this, but there are some. There are nearly none (maybe literally none) in the social sciences, or in the field of human behavior in general. I think hypothesis testing nearly never makes sense in those contexts.

          I think your questions 1 and 2 are good questions, and worthy of exploration. With regard to the first of them: Yes, in almost every context it makes sense to ask whether such-and-such an effect is big enough to be important. (Indeed, I think it’s very unfortunate that statistics uses the word ‘significant’ to mean something totally different from ‘important’, and I think a lot of misunderstanding is engendered when a statistician says that an effect is ‘significant’ and a lay listener interprets this to mean it is big enough to be important). So, yeah, there’s no point testing whether such-and-such an effect is literally zero, but it often makes sense to try to estimate the size of an effect and see whether it’s big enough to be important. If you conclude that a bad Adam Sandler makes just about as much money as a good Adam Sandler movie [sic] then don’t bother hiring a script doctor or whatever.

  5. Eric has to stop hanging out with Richard. I had a lot of respect for Eric before that.

    I agree Andrew, I respect some of these authors quite a bit but these tortured and inappropriate efforts to sell Bayesian methods have got to stop or they (one of them at least) are going to shoot their own reputation down.

    • Psyoskeptic:

      I don’t think any reputations are getting shot down. I consider the sort of work in the paper under discussion to be speculative. I think it is not well founded, but various not-well-founded ideas can work ok. Hell, even p-values have solved a lot of problems. A couple years ago in this blog I discussed how I’d underestimated the importance of the lasso idea, just because the justifications given for the method didn’t make sense. Lots of people have wrongly dismissed Bayesian methods because they felt that the idea of the prior distribution was unscientific. So, sure, I don’t like the method described in this paper and I made no bones about it. I think their method won’t be useful. But, who knows, maybe I’m wrong. It’s fine for them to put it out there, and if these authors do other good work, that’s fine.

    • Yeah that Richard. He is something else. Always ruining reputations. Did you know he blindfolds and beats his coauthors until they agree with his Bayesian demagoguery. You should remain behind your anonymous mask, psyoskeptic, so that he cant get you with his evil powers.

    • “Eric has to stop hanging out with Richard. I had a lot of respect for Eric before that.”

      This is super presumptuous on several different levels, and the implication is also just wrong. EJ’s been selling Bayesian methods since I was in grad school (for example, this was written before I even knew EJ, and before I got my PhD: Like them or dislike them, these examples are EJ’s and they come from his talks. Don’t give me credit for the work of my collaborator who is my senior.

      Since the implication also seems to be that I’m somehow responsible for EJ’s advocacy of Bayes factors, let me state my opinion about BFs here. I’ve said all this elsewhere on social media and in courses I teach, but here it is again.

      Bayes factors are an inescapable consequence of Bayes’ theorem. However, this does not mean that Bayes factor point-null hypothesis testing is appropriate in all situations. Or most situations. When I consult, well more than half the time I steer researchers away from them for exactly the reason Gelman stated in the post. They’re just not right for many research questions. When I teach Bayes, I do not spend much time on BFs; I focus on building/checking models (which can be done in roughly the same way as a classical statistician would do) and interpreting posteriors. When I review, I suggest removing Bayes factor point-null tests where they are not appropriate or helpful, in spite of this reducing the visibility of my own work. When I analyse data, I most often rely on posterior estimation of complex models with model checking, in a manner similar to Gelman’s “continuous model expansion”.

      • +1 to Richard Morey’s comments.

        To repeat myself, I feel that people should aspire to acquiring the understanding that people like Richard have, so that they can make their own decision on how to proceed in each individual case. People should not just blindly copy this or that technique as the *the* solution that can be applied again and again in a McDonald’s I’ll-have-this-to-go approach.

        The response I get when I say this to people starting out in academia is that they don’t have the time to acquire the knowledge, they have to get tenure and therefore publish enough first. Of course, the same people would not hesitate to take the time to acquire the relevant knowledge in their own specialization. It’s only statistics that is considered to be something to be outsourced to experts, something that needs less attention. Somehow statistics needs to become embedded as part of the core curriculum of every scientific field. When one does a PhD in ling, one has to have relatively good mastery in all the areas, phonetics, phonology, morphology, syntax, semantics, sociolinguistics, pragmatics, psycholinguistics, computational linguistics, etc. The same must go in psychology and other areas. Statistics is an afterthought in these areas, and that’s where the problems begin. If the Morey and Wagenmakers paper being criticized here were to appear in the kind of ideal environment where statistics were already embedded as part of the core knowledge, there would be less room for misuse. But in that ideal case, even frequentist methods would be OK for many situations. Editors wouldn’t hanker after low p-values as more evidence for the specific alternative, they wouldn’t complain about “useless” replications, they would know that p-values tell you nothing about replicability of the result, that model assumptions matter, and transparency matters. In such an environment, if someone were to disagree with the approaches Morey and Wagenmakers propose, they can go ahead and reanalyze the data themselves and show why their conclusions were wrong (if they were).

        • >”Statistics is an afterthought in these areas, and that’s where the problems begin.”

          I agree those classes are an afterthought (at least that is a very apt description of my biomed experience), but wouldn’t be so sure that isn’t because at some level the thing taught as “statistics” is recognized to be a way to just legitimatize BS.

          I was pretty much told that grad school stats teachers get fired if they don’t help to produce papers, by whatever means. To avoid Godwin’s law, basically my impression is that teaching NHST can be primarily attributed to learned helplessness.

      • I should also emphasise that it is important to understand that substantive claims are made at a different level than statistical claims. The critique that “the correlation is not precisely zero” can be answered by pointing out that *the correlation* — in reality — isn’t anything. The correlation is a number that lives in an abstract, mathematical university. “Populations” aren’t “normal” – in fact, “populations” don’t really exist in the statistical sense. But the statistical statement that, for these data, the (relative) evidence for the correlation being 0 is X versus a particular reasonable alternative may be useful in making a substantive point: for instance, that there is (much) more to Sandler making money than critical ratings. Ideally, a substantive claim should be supported with multiple lines of evidence, each thoroughly checked; not just a single number.

        There is a tension between statistical pedagogy and practice, in the sense that most good data analysts learn to not take their statistics too seriously. It is difficult to train this nuance.

  6. Hi Andrew,

    Alas, I’m afraid that my morbid fascination with null hypothesis testing may not stop any time soon…
    Personally I am persuaded by the Wrinch and Jeffreys (1921) argument for assigning mass to a point null. See the historical overview by Alexander Etz, in press for Statistical Science (preprint at But I also like to think that if you were operating in my field those nulls would suddenly look a lot more plausible (and vice versa, in your field I agree the null is often not of interest).

    Anyway, I am not sure I follow your claim about the predictive intervals. Bayesian dogma (Pratt, Lindley) has it that classical inference cannot provide, in the IQ example: (1) the probability of Bob’s true IQ falling between x and y; (2) the relative plausibility that Bob’s true IQ is z versus y. Are you saying that you can answer these questions with predictive intervals? It seems unlikely to me, because questions 1 and 2 require a prior, and their answer therefore also varies with the prior.


  7. Hi Andrew,

    If I understand, you consider the model rho=0 objectionable. Let’s agree that models are neither true or false, data do not follow normals, etc. The question of course is whether rho=0 is a theoretically useful statement. I think in many contexts point mass is a useful, theoretically interesting statements of constraint and invariance. Certainly, much of science has proceeded with the notion that point masses, equalities, are theoretically important (see, say, physics). Do you always object to points or do you object in this example? My tendency is to give deference to substantive researchers on whether points are theoretically useful or not. It’s not a statistical issue to me.


  8. I have a related but different worry with proposals along these lines.

    I started fitting linear mixed models after it was suggested to me by the Ohio State Statistical Consulting service that that is what I should be doing. I installed the nlme package and cut along the dotted line as instructed by the statisticians at OSU. This was all fine because they signed off on my analysis; I myself had no idea about the underlying machinery (this was around 2000). There was an in-retrospect hilarious moment when the statistics grad student helping me asked me what the variance covariance matrix in my analysis looked like and I had really no idea what that was and how to even find out the answer. She had to send me the relevant command for me to print out the vcov matrix.

    Then lme4 came along and I started using that just like every other psycholinguist does, gradually starting to understand more and more about the underlying theory. It was only after I did the MSc in Stats at Sheffield that I finally got something closer to the full picture. In retrospect, the canned analyses I did did not serve me well. What I should have done in 2000 was take a course or two in statistics *in the statistics department*. There is usually no point taking a course in a linguistics or psych department because you will invariably get a distorted and very incomplete picture (ironic because I teach statistics in a linguistics dept).

    The central problem is that canned software encourages distance from the details. This kind of convenience is a very dangerous thing. Bayes will probably go the same way as the abuse of frequentist methods for this reason. rstanarm and JASP are great when you are an expert like Gelman, Morey, Rouder, or Wagenmakers, but they are a deadly tool in a novice’s hands, especially if the novice has the idea—encouraged actively in the psych* world at least—that knowing the details of how the underlying moving parts work is optional. People provide canned one-and-done “recommendations”, and it’s downhill after that.

      • Shravan:

        Thanks for relating your experiences here as that makes underlying pragmatic concerns for scientific communities very salient.

        (My experiences here involved _deprogramming_ clinical experts who had bought into lay introductions to Bayesian methods as being essentially infallible – e.g. believed a prior (that was just a software default) fully captured all past clinical knowledge.)

        If you are not already aware you might find this a interesting discussion of similar concerns

    • That’s what got me to like JAGS so much, after reading Andrew and Jennifer’s book. Actually writing out the model explicitly, in order to fit it, does wonders for understanding.

      • Me, too, though it was BUGS through R2WinBUGS that got me. I understood how BUGS worked and what a generative model was and how the model related to a density, but Andrew and Jennifer’s book was the only thing I could understand from the stats world. Everything else (including BDA) seemed to be speaking a foreign language.

  9. As a pragmatic researcher myself, I would have really liked the cited paper to give more of a detailed exposition, showing precisely how the two problems are solved! The resulting graphics are interesting, but there wasn’t a clear enough path for me (other than the source code), explaining the thought processes.

Leave a Reply

Your email address will not be published. Required fields are marked *