Skip to content

But when you call me Bayesian, I know I’m not the only one

Textbooks on statistics emphasize care and precision, via concepts such as reliability and validity in measurement, random sampling and treatment assignment in data collection, and causal identification and bias in estimation. But how do researchers decide what to believe and what to trust when choosing which statistical methods to use? How do they decide the credibility of methods? Statisticians and statistical practitioners seem to rely on a sense of anecdotal evidence based on personal experience and on the attitudes of trusted colleagues. Authorship, reputation, and past experience are thus central to decisions about statistical procedures.

The above paragraph is the abstract for the article, Convincing Evidence, by Keith O’Rourke and myself, which appeared in the just-published volume, “Roles, Trust, and Reputation in Social Media Knowledge Markets,” edited by Sorin Matei and Elisa Bertino.

Here’s how we begin:

The rules of evidence as presented in statistics textbooks are not the same as the informal criteria that statisticians and practitioners use in deciding what methods to use.

According to the official rules, statistical decisions should be based on careful design of data collection, reliable and valid measurement, and something approximating unbiased or calibrated estimation. The first allows both some choice of the assumptions and an opportunity to increase their credibility, the second tries to avoid avoidable noise and error and third tries to restrict to methods that are seemingly fair. This may be fine for evaluating psychological experiments, or medical treatments, or economic policies, but we as statisticians do not generally follow these rules when considering improvements in our teaching nor when deciding what statistical methods to use.

Did Fisher decide to use maximum likelihood because he evaluated its performance and the method had a high likelihood? Did Neyman decide to accept a hypothesis testing framework for statistics because it was not rejected at a 5% level? Did Jeffreys use probability calculations to determine there were high posterior odds of Bayesian inference being correct? Did Tukey perform a multiple comparisons analysis to evaluate the effectiveness of his multiple comparisons procedure? Did Rubin use matching and regression to analyze the efficacy of the potential-outcome framework for causal inference? Did Efron perform a bootstrap of existing statistical analyses to demonstrate the empirical effectiveness of resampling? Do the authors of textbooks on experimental design use their principles to decide what to put in their books? No, no, no, no, no, no, and no. . . .

We continue:

How, then, do we gain our knowledge about how to analyze data? This is a question that arises over and over as we encounter new sources of data that are larger and more structured than ever before. . . .

I don’t have all the answers, but I think these are important questions.


  1. zbicyclist says:

    A nice, quick read with interesting examples.

    With younger researchers (in an industry context) I seemed to see a back of the book bias. There was an assumption that methods which were towards the back of the book (as far as they got in the book) were better.

    And, in some sense they were. Clients were likely to be slightly familiar with the buzzwords, but not understand the methods at all (but somewhat afraid to ask questions about them, because nobody likes to ask a dumb question).

    • zbicyclist says:

      I think I have to follow some of the references on the issue of when crossvalidation works better and when it does not work so well.

      As I get older, I get increasingly interested in crossvalidiation / holdout samples / researcher replication issues.

      Woody Hayes used to say he didn’t use the forward pass much because three things could happen and two of them were bad [*]. Similarly, for researchers seeing to validate their work, three things can happen and two of them are bad.

      (a) you validate / replicate [complete the pass]
      (b) you fail to replicate [interception]
      (c) you sort of validate, but the effect looks different, typically smaller and perhaps with the indicators in a different order [incomplete pass]. This may lead to yet further work to sort this out.

      Now you remind me of the fact that different forms of validation work better for different methods, a topic I’ve tried not to think about. ;)

      • Keith O'Rourke says:


        For a self-funded (wealthy) scientist (in the true sense of the word) only one is bad and that’s a!

        (a) you validate / replicate [doubt is/stays resolved and inquiry (learning) ceases]
        (b) you fail to replicate [doubt results and inquiry resumes]
        (c) you sort of validate, but the effect looks different, typically smaller and perhaps with the indicators in a different order [some new doubts result and inquiry into those commences]. This may lead to yet further (productive/insightful) work to sort this out.

        Now, (a) might allow one to address doubts somewhere else for a while and these may be more important but validate/replicate should not (for empirical questions) ever be taken as not somehow being mistaken [(a) will almost always be later found to be mistaken and no one will ever be sure when an exception to this has occurred].

  2. Dale Lehman says:

    In other words, statisticians are people too?

    Yes, and that is part of the difficulty in teaching and learning statistics. Our experience of the world is as individuals and the subject treats individuals as just one observation. As a result, various psychological barriers stand between what statistical theories tell us and the way we approach problems, view the data, decide whether our analysis is “correct” or not, decide when we have done sufficient analysis, etc. Our theoretical knowledge of the proper importance to attach to individual observations and stories is simply no match for the biological and psychological forces that we are all subject to.

    It seems to me that there are a variety of mechanisms that can be used to overcome (to some extent) these problems – joint work should be preferred to single authored work, data should be openly provided, more open reviewing of work submitted for publication should be used in the review process, etc. All of these work against the traditional incentives in academia and research.

  3. Rahul says:

    Are you asking why people do not run a controlled experiment for every descision they take? I didn’t get the core point.

  4. Anonymous says:

    Such a great title. Thanks for a nice Monday morning laugh.

  5. Philippe says:

    Factual knowledge will never provide support for a scientific method because there is an infinite regress problem: you always need to assume some method to obtain any factual knowledge. So epistemology precedes (in an a priori sense) whatever knowledge one assumes to have infered from the data. Which is why you need epistemologists and philosophers of science.

    • Elrod says:

      I think this is an optimization problem, where we at worst may inch down error gradients to find local minima rather than suffer infinite regresses.

      We have a bottomline of the things we ultimately care about — afterall, why are we doing anything at all? Our interests are attached to reality, whether we are trying to predict human behavior, the impacts of a policy, or how a new drug will benefit an Alzheimers patient.
      To begin with, we may speculate that an experimental method’s effectiveness may be judged by how well it shifts and concentrates probability mass to the predictions that were later found to actually take place & be supported. That is, how well they allowed us to make predictions, to control what we anticipate and what we do not.
      The methods we use can be improved based on this initial assumption. The criteria itself may then be judged and refined based on how well its use serves the bottomline. This may result in other evaluations being revised, and through an iterative process we may converge on a local optima for goal-meeting efficacy.

      Higher levels of recursion are avoided, because tying ourselves to the reality we care about makes each repetition a reiteration of the question: “How well does method Q lead to the real world results we care about?”
      Instead of question about the effectiveness of the preceding update precedure (begging the question of the effectiveness of whatever answer is given then), which would lead to an infinite regress drifting ever further away from the real world we actually care about in both actuality and theory.

      We need to keep our eyes on the goal. The goals of course should be something more along the lines of “lets find out just how Alzheimers patients respond to this drug”, and less “lets try to publish something exciting in Psychological Science”.

  6. Bill Jefferys says:

    Andrew, in your paper you have a sentence:

    “So I want back and improved the model (with the collaboration of my student Yair Ghitza).”

    I think that you meant ‘went’ instead of ‘want’. Probably an automatic spelling-corrector error that wasn’t noticed because when it makes the automatic correction it doesn’t tell us (like by underlining it in red).

    The plagues of modern life!

  7. Anonymous says:

    one way to formalize this is to say the analysis model itself is the hyper-hyperprior.

  8. Forget statistics, you can’t even do mathematics this way. Frege gave it a shot in the Begriffschrift, Russell found a hole in the comprehension principle (with the elusive man who shaves everyone who doesn’t shave himself), then Gödel put the nail in the coffin of purely “formal” mathematics by proving if your system is strong enough to do (Peano) arithmetic (i.e., induction), then it’s too strong to reason about its own truths. You also can’t write a computer program that will inspect another program and input and determine if the program will halt on that input. Though arguably it really goes back to Cantor’s diagonalization proof that the set of all sets of integers is bigger than the set of integers (in the sense of there being no one-to-one mapping between the sets).

    • Keith O'Rourke says:


      I don’t think either of us was thinking of deduction about deduction but rather induction about induction – given ours and other’s (of different backgrounds) experiences in analyzing studies – what are good methods, when and for whom?

      In a given study, did the analysis seem to resolve our doubts about what could be learned from the study (given where we were when it was analyzed) while leaving open (maximizing actually) avenues for future doubts to arise about this?

      Across different studies and analysts, in principle, this could be quantitative induction where the same data sets are analysed in different ways by different people (randomized even) but I think mostly it will be qualitative (as for many studies no one will know the _correct conclusions_ what would be learned from the study given fully adequate analysis).

      From the paper: “Theory and simulations are only as good as their assumptions; results from toy problems and benchmarks don’t necessarily generalize to applications of interest; cross-validation and external validation can work for some sorts of predictions but not others; and subject-matter experts and paying customers can be fooled.”

  9. Rahul says:

    Just because one proposes a new method does not mean one thinks it is the right method for every problem. Ergo I don’t see why Tukey ought to use his new method to vet his new method.

    Pulling oneself over a fence with ones own bootstraps doesn’t seem like a common process.

    • george says:

      Pulling oneself over a fence with ones own bootstraps doesn’t seem like a common process.

      Even pulling oneself out of quicksand this way is a bit questionable.

    • Andrew says:


      Nobody is claiming that statistical design, measurement, and inference are appropriate for every problem. But I do find it striking that statisticians (including myself) present statistics as a general way of thinking about problems, but we don’t directly use statistics to make many of our most important decisions and inferences; instead we rely on the sort of anecdotal reasoning that we often criticize when it is done by others.

      I’m not saying we’re necessarily doing things wrong; still, it’s an interesting contrast that I think is worth looking into, a good Why question, one might say.

      • Rahul says:

        Of all the decisions (even important decisions) a person makes throughout his life, a very very small fraction go through the formal statistical process.

        Most of our decision making happens via intuition & heuristics.

        The obvious explanation seems high overheads.

        • Andrew says:


          Yes, and, as I said, I’m not saying we should use Bayesian methods (say) to decide whether to use Bayesian inference to analyze our data.

          My point is that statisticians don’t seem to have a good way of talking about the use of intuition and heuristics to make decisions. For example, when we write about education research we often talk about all the flaws of anecdotal inference and decision making, and we make a bit deal about random assignment, random sampling, and (if we are sophisticated) accurate measurement of treatments, background variables, and outcomes. But then when we decide how to teach our own classes, all that gets thrown away.

          Similarly, if you read statisticians writing about what method to use, they tend to just assert that their approach is best, without giving any systematic evidence. Computer scientists are better in that they do all those bake-offs (which have their own problems, but that’s another story). Again, there’s a disjunction between what we do, and our ideology for how we are supposed to decide what to do.

          • Martha says:


            Can you give any evidence supporting your statement, “Similarly, if you read statisticians writing about what method to use, they tend to just assert that their approach is best, without giving any systematic evidence. “?

            (Or is the phrase “they tend to” so vague that your statement is open to a wide variety of interpretations?)

            • Andrew says:


              It’s just not something I ever see in this setting. Yes, a statistics paper will offer hard evidence in the context of a single example or class of models, showing that method A outperforms method B, but I think of that more as local reasoning than as supporting larger decisions of how to proceed when doing statistics. Again, that’s fine, I just think it’s worth recognizing, as we tend to disparage anecdotal or theory-based reasoning in other empirical fields.

  10. jrc says:

    Did Rubin use matching and regression to analyze the efficacy of the potential-outcome framework for causal inference?

    – LaLonde, Robert J. “Evaluating the econometric evaluations of training programs with experimental data.” The American Economic Review (1986): 604-620.

    Did Efron perform a bootstrap of existing statistical analyses to demonstrate the empirical effectiveness of resampling?

    – Bertrand, Marianne, Esther Duflo, and Sendhil Mullainathan. “How Much Should We Trust Differences-In-Differences Estimates?*.” The Quarterly journal of economics 119.1 (2004): 249-275.
    -Cameron, A. Colin, Jonah B. Gelbach, and Douglas L. Miller. “Bootstrap-based improvements for inference with clustered errors.” Review of Economics and Statistics 90.3 (2008): 414-427.

    … my point is just that, even if it has taken a while (I would guess at least partly due to lack of computing power), “experimental statistics” is finding its own as a sub-field. On experimental statistics:

    “Experimental statistics carries out reproducible computational experiments, numerical or symbolic, that test falsifiable predictions from theoretical statistics about the performance on data of specified statistical procedures.”

Leave a Reply to Andrew