Skip to content

Ways of knowing

In this discussion from last month, computer science student and Judea Pearl collaborator Elias Barenboim expressed an attitude that hierarchical Bayesian methods might be fine in practice but that they lack theory, that Bayesians can’t succeed in toy problems. I posted a P.S. there which might not have been noticed so I will put it here:

I now realize that there is some disagreement about what constitutes a “guarantee.” In one of his comments, Barenboim writes, “the assurance we have that the result must hold as long as the assumptions in the model are correct should be regarded as a guarantee.” In that sense, yes, we have guarantees! It is fundamental to Bayesian inference that the result must hold if the assumptions in the model are correct. We have lots of that in Bayesian Data Analysis (particularly in the first four chapters but implicitly elsewhere as well), and this is also covered in the classic books by Lindley, Jaynes, and others. This sort of guarantee is indeed pleasant, and there is a long history of Bayesians studying it in theory and in toy problems. Arguably, many of the examples in Bayesian Data Analysis (for example, the 8 schools example in chapter 5) can be seen as toy problems. As I wrote earlier, I don’t think theoretical proofs or toy problems are useless, I just find applied examples to be more convincing. Theory and toys can be helpful in giving us a clearer understanding of our methods.

Ways of knowing

Why do I go on and on about this? I am interested in how we “know,” in this case how we decide to believe in the effectiveness of a statistical method. Here are a few potential sources of evidence in favor of a method:

– Mathematical theory (for example, coherence of inference or asymptotic convergence);

– Computer simulations (for example, demonstrating approximate coverage of interval estimates under some range of deviations from an assumed model);

– Solutions to toy problems (for example, comparing the partial pooling estimate for the 8 schools to the no pooling or complete pooling estimates);

– Improved performance on benchmark problems (for example, getting better predictions for the Boston Housing Data);

– Cross-validation and external validation of predictions;

– Success as recognized in a field of application (for example, our estimates of the incumbency advantage in congressional elections);

– Success in the marketplace (under the theory that if people are willing to pay for something, it is likely to have something to offer).

None of these is enough on its own. Theory and simulations are only as good as their assumptions; results from toy problems and benchmarks don’t necessarily generalize to applications of interest; cross-validation and external validation can work for some sorts of predictions but not others; and subject-matter experts and paying customers can be fooled.

The very imperfections of each of these sorts of evidence gives a clue as to why it makes sense to care about all of them. We can’t know for sure so it makes sense to have many ways of knowing.

Progress! Bayesian methods have moved from plaything to practical tool

Go back in time 50 years or so and read the discussions of Bayesian inference back then. At that time, there were some applied successes (for example, I. J. Good repeatedly referred to his successes using Bayesian methods to break codes in the second world war) but most of the arguments in favor of Bayes were theoretical. To start with, it was (and remains) trivially (but not unimportantly) true that, conditional on the model, Bayesian inference gives the right answer. The whole discussion then shifts to whether the model is true, or, better, how the methods perform under the (essentially certain) condition that the model’s assumptions are violated, which leads into the tangle of various theorems about robustness or lack thereof.

50 years ago one of Bayesianism’s major assets was its theoretical coference, with various theorems demonstrating that, under the right assumptions, Bayesian inference is optimal. Bayesians also spent a lot of time writing about toy problems (for example, Basu’s example of the weights of elephants). From the other direction, classical statisticians felt that Bayesians were idealistic and detached from reality.

How things have changed! To me, the key turning points occurred around 1970-1980, when statisticians such as Lindley, Novick, Smith, Dempster, and Rubin applied hierarchical Bayesian modeling to solve problems in education research that could not be easily attacked otherwise. Meanwhile Box did similar work in industrial experimentation and Efron and Morris connected these approaches to non-Bayesian theoretical ideas. The key in any case was to use partial pooling to learn about groups for which there was only a small amount of local data.

Lindley, Novick, and the others came at this problem in several ways. First, there was Bayesian theory. They realized that, rather than seeing certain aspects of Bayes (for example, the need to choose priors) as limitations, they could see them as opportunities (priors can be estimated from data!) with the next step folding this approach back into the Bayesian formalism via hierarchical modeling. We (the Bayesian community) are still doing research on these ideas; see, for example, this recent paper by Polson and Scott on prior distributions for hierarchical scale parameters.

The second way that Lindley, Novick, etc. succeeded was by applying their methods on realistic problems. This is a pattern that has happened with just about every successful statistical method I can think of: an interplay between theory and practice. Theory suggests an approach which is modified in application, or practical decisions suggest a new method which is then studied mathematically, and this process goes back and forth.

To continue with the timeline: the modern success of Bayesian methods is often attributed to our ability using methods such as the Gibbs sampler and Metropolis algorithm to fit an essentially unlimited variety of models: practitioners can use programs such as Bugs to fit their own models, and researchers can implement new models at the expense of some programming but without the need of continually developing new approximations and new theory for each model. I think that’s right—Markov chain simulation methods indeed allow us to get out of the pick-your-model-from-the-cookbook trap—but I think the hierarchical models of the 1970s (which were fit using various approximations, no MCMC) showed the way.

To get back to the discussion from last month: Of course Bayesian inference has “theoretical guarantees” of the sort that our correspondent Barenboim was looking for. Back 50 years ago, this theoretical guarantee was almost all that Bayesian statisticians had to offer. But now that we have decades of applied successes, that is naturally what we point to. From the perspective of Bayesians such as myself, theory is valuable (our Bayesian Data Analysis book is full of mathematical derivations, each of which can be viewed if you’d like as a theoretical guarantee that various procedures give correct inferences conditional on assumed models) but applications are particularly convincing.

Over the years I have become pluralistic in my attitudes toward statistical methods. Partly this comes from my understanding of the history. Bayesian inference seemed like a theoretical toy and was considered by many leading statisticians as somewhere between a joke and a menace, but the hardcore Bayesians persisted and got some useful methods out of it. Bootstrapping is an idea that in some way is obviously wrong (as it assigns zero probability to data that did not occur, which would seem to violate the most basic ideas of statistical sampling) yet has become useful to many and has since been supported in many cases by theory. Etc etc etc.


  1. K? O'Rourke says:

    Seems a lot like Bradford Hill’s list of some things to consider that might mislead you when entertaining causality in non-randomized studies.

    I don’t think there is any way to get by this “face validity” (makes sense) in answers to empirical questions.

  2. ” Bootstrapping is an idea that in some way is obviously wrong (as it assigns zero probability to data that did not occur, which would seem to violate the most basic ideas of statistical sampling)”

    Huh. I’m not an expert in bootstrapping and as a pretty hardcore bayesian I am inclined to ignore it. But the sentence you’ve written here (assigning zero probability to data that did not occur) is exactly how Bayesian inference works, and IMO is why it’s obviously right. I must be misinterpreting your language.

  3. great post! What do you make of the Cox theorems and etc? How important are or were they in your decision to pursue Bayesian methods? They have been influential for me when deciding between what I think of as “ad-hoc” methods (choosing point estimates) and Bayesian methods for making predictions. But I have to say I don’t really understand how relevant they are to real-world data analysis problems, so possibly the Cox theorems are really a red herring?

    • konrad says:

      To me, the role of the Cox theorems is as a cornerstone of a (compelling but not necessarily unique) justification for Bayesian methods – not a red herring, but not something that has direct impact on methodology. Which means not everyone has to be interested in them (Andrew seems not to be).

  4. Tom Moertel says:

    Reading the past discussions, I didn’t get the impression that Elias Barenboim was questioning the theoretical support for Bayesian statistical methods. Rather, I believe he was questioning whether there was similar theoretical support for the use of those methods in causal applications and, in particular, whether that support lead to straightforward guarantees comparable to those that follow from causal directed-graph approaches.

    • Andrew says:


      The (long) series of posts and comments began with a discussion I posted on the use of hierarchical models to extrapolate to new settings. I don’t think of hierarchical modeling and casual directed graphs as being competitors; rather, if someone wants to use causal directed graphs, I recommend they use hierarchical modeling to do partial pooling, rather than trying to decide or estimate a stark choice between no pooling and complete pooling.

  5. K? O'Rourke says:

    Missed the interesting history stuff:

    WG Cochran did partial pooling in his 1937 paper on the analysis of repeated agricultural experiments and Godambe told us once that his initial Phd thesis proposal (which would have been with Fisher, I believe) was based on the realization that “priors can be estimated from data!” but he abandoned it because it was already done by some early empirical Bayes author. But I think you are right about the provision of the formal Bayes hierarchical model and some clear examples providing the impetus.

    Not sure bootstrapping implies assignment zero probability to data that did not occur (even the non-smooth naive one) but rather it is a choice of an approximation of the unknown probability generating model distribution by (restricting to) a discrete distribution with exactly n points of increase (via method of moments estimation using 2*n -1 moments aka Von Mises Step Function approximation) from which the distribution of any function of possible sample paths on the sample space can be approximated by sampling with replacement. (When Rob Tibshirani first presented the bootstrap in a graduate course, when he was a post-doc, I asked if it wasn’t just a method of moments. Years later, in an email discussion with Peter Hall, he agreed that it was, but said he did not find it useful to think of it that way.)

    My other sense is that Rubin’s 1984 Bayesianly justifiable and relevant paper, did not have anywhere near the impact it “should have”…

  6. Andrew,

    Tom hit it right on  target.
    I was asking whether there is similar theoretical support for the use of Hierarchical Bayes methods in CAUSAL applications, e.g.,  transporting experimental results across populations.

    You  took my question as critical of the entire enterprise of Bayesian inference, which would be the last thing on my mind, given the indisputable achievements of Bayesian methods in all aspects of AI (with which I am more familiar).

    Going back to our original discussion, we came very close to understanding when you assured me that “bias matters, it is a good idea to analyze”.

    I then asked at what point of the analysis should bias be analyzed, given that bias is the main concern in transportability problems, and that we have not seen any bias analysis conducted in hierarchical modeling, can we continue this question?

    Here is my last posting:

    I am not sure however if you agree with Manski on the logical priority of bias analysis: He says “The study of identification logically comes first. it makes no sense to try to use a sample of finite size to infer something that could not be learned even if a sample of infinite size were available”.

    It seems to me that, if you agree with this priority then, in every causal inference task, HM researchers should be waiting for the results of identification analysis before starting the estimation phase. And, in such a case, they would be curious to find out what bias-analysis says about generalizability before applying any multi-level estimation.

    On the other hand, if you don’t agree with Manski’s priority, the question arises: WHEN, in your opinion, bias should be analyzed,? Should it be after the estimation phase? Perhaps before estimation, but after glancing at the data? And, if it is analyzed, how? With simplifying assumptions or without? And which assumptions would you permit? Which would you forbid?

    Finally, I will add an argument in favor of “identification first — estimation second” which also covers the questions raised by Konrad [at that time]. Identification analysis not only answers the question “is an unbiased estimate possible”, but also provides us with an “estimand”, which should serve as the target in the estimation procedure.  Without the proper estimand, statistical estimation can be chasing after the wrong parameter without ever answering the research question at hand  (in our example we need the causal effect of X on Y in NYC).

    We heard many complaints here about  assumptions making the problem “easier”, “essentially solved”, “invalid”, “toy-like”, etc etc etc. Now, suppose our “easy” analysis yields a negative answer, i.e., non-transportable. Would including more realistic assumptions (e.g., ”measurement problems, people coming in and out of the sample, gaming the system, etc etc etc.”) make estimation feasible?

    A negative result from any of our examples means that no statistical method, no matter how sophisticated or revered, can estimate what needs to  be estimated. A negative result actually guarantees us that whatever you are trying to estimate is the wrong thing to estimate — should a researcher engage in estimation before checking the possibility that such negative verdict would be issued by the analysis of bias?

    Thus, the priority “identification first — estimation second” is  not a matter of convenience or personal preferences; it is a matter of technical necessity. I wonder whether you see this priority reflected in the practice of HM or, if it’s not happening now, whether you think it should be encouraged in the future.

    I am really curious about your take on these issues. Still, if you feel these questions would require too much of your time, I would be quite satisfied with a quick, yes-no answer on whether you agree with Manski’s priority.

    Thank you for all your patience,


    • Andrew says:


      Regarding transportability, you earlier wrote, “The only way investigators can decide whether ‘hierarchical modeling is the way to go’ is for someone to demonstrate the method on a toy example.”

      I disagree. Perhaps the only way that you can decide is by looking at a demonstration on a toy example, but other investigators use other criteria for deciding whether a method is “the way to go.” That is the point of my above post. I mentioned seven different ways of knowing. Toy examples are great—believe it or not, I’m still learning from the 8 schools example, nearly thirty years after I first heard about it—but, for many investigators who aren’t you, there are other ways of knowing.

      • Andrew,

        I have much to say on “toy examples” and “other ways of knowing” , but I would rather not divert attention from the more promising discussion we had, on the role of “bias analysis” in hierarchical methods.

        Any chance of addressing this question, especially Manski’s priority of identification over estimation?


  7. Tom Moertel says:

    If a method cannot be shown to be reliable for small problems, what reason has any person to trust that it will be reliable for larger problems?

    • Andrew says:


      I don’t know. Yet over 30,000 people have bought our books on Bayesian statistics and hierarchical modeling. This is not to say we’re correct—Malcolm Gladwell and Thomas Friedman are much bigger sellers!—but just to say that there are many ways of knowing. I think a lot of people are interested in our methods because they give convincing results in new and interesting examples. That’s no reason for you to read my book. But you might consider that there are different reasons to believe in the effectiveness of a statistical method. Make of that what you will.

      P.S. The classical normal-theory confidence interval typically works well for large problems (for example, a survey with n=1500) but is not so reliable for small problems (it completely falls apart when n=1, and just a few months ago we had a problem with n=75 where y=0 and so we used the Agresti and Coull method instead). Also, a lot of machine learning methods are said to work well with huge datasets but not so well on small problems. I like to think about small problems too, but I don’t see them as a dominant way of knowing.

  8. konrad says:

    Is there agreement on what is meant by “toy problem”? I was under the impression a toy problem is one for which the “correct” model is known a priori – i.e. the problem is crisply stated and we know exactly which model assumptions to make. But comments above seem to imply that toy problems are ones that are smaller in some sense? (In terms of number of parameters? Or data set size? If anything, doesn’t a smaller data set make a problem _harder_ and therefore _less_ like a toy problem?)

    • Konrad,

      Model-correctness is only one ingredient of a  “toy problem”, the other is “knowing the correct answer in advance”.

      Example 1.
      Assume we have an exact, yet complex model of the US economy, with 100,000 variables and all relationships (i.e., arrows) correctly specified. This still does not allow us to check if  a certain identification-estimation routine does a better job than another (in term of advice to decision makers) unless we compare actual predictions.

      Example 2.
      Assume someone discovers strong correlation between children reading ability and their thumb-size. We know the answer in advance: taking reading classes will NOT hasten your thumb growth.
      Now imagine two competing methods of analysis, each saying: You give me data on X (reading), Y (thumb size) and Z (Age), a few assumptions about their relationships, and I will estimate for you the effect of X on Y. If method 1 comes up with ZERO effect, and method 2 with NON-ZERO effect, I would prefer method 1. Because I know the answer in advance.

      This is the kind of toy-problems on which I have hoped to learn how hierarchical methods work. I have assumed that practitioners of any inference method would be interested in demonstrating to curious visitors how their method gets the right result on each toy problem presented.

      • konrad says:

        Example 1: When one has an exact model one can use simulation to measure the performance (e.g. any frequency properties you like) of any estimation routine.

        Example 2: I agree with the idea of doing such sanity checks, but they are typically also present in real world problems. It’s not what makes it a toy problem.

    • Andrew says:


      I’m not exactly sure how to define a toy problem. I think the 8 schools problem was originally not a toy, but it’s been studied so much that it’s become somehow toylike.


      Yes, I recognize that to you (and some others), “that practitioners of any inference method would be interested in demonstrating to curious visitors how their method gets the right result on each toy problem presented.” To others, it’s more important to see how a method works on real problems. There are different ways of knowing. This may be something I’ve learned after nearly thirty years of doing statistics, or perhaps it has something to do with my social science background. I would never, for example, imagine “an exact, yet complex model of the US economy, with 100,000 variables and all relationships (i.e., arrows) correctly specified.” That sort of thing just means nothing to me. I don’t think the economy works that way.

      You are free to use whatever considerations you would like in choosing a statistical method, but it might be worth recognizing that many other people have other ways of knowing.

  9. konrad says:

    The proper term for someone claiming to have an exact, yet complex model of the US economy, with 100,000 variables and all relationships (i.e., arrows) correctly specified is “crank”:

  10. I just had to say, again, “Great Post!” I admire you so much for blogging this stuff. Engaging a grad student also deserves a lot of respect.

    • Andrew says:


      Thanks. Back when I was a grad student and then a young professor, I would become extremely extremely frustrated when various senior scholars would essentially refuse to let me engage with them. I’d go up after a talk and ask questions and they’d just smile amiably and dodge my questions. It made me want to scream. So I don’t want to be like that.

  11. Dear Andrew, Konrad, et al,

    I see that my asking to “imagine a correct yet complex model” has fallen on angry ears, it has even earned me the title “crank” — I accept. Likewise, it seems that I am the only guy on the block who enjoys learning by first solving toy examples and then scaling things up. 

    Therefore, I don’t think I can offer much for the remaining of this discussion. Still, if interested, I will be  happy to demonstrate (privately) how “toy problems” like the one with children’s thumb-size, even scaled up to 100,000 variables, can be solved for fun (and perhaps for practical and real purposes).

    There remains only my humble question on whether Andrew agrees with Manski’s priority: identification first – estimation second.
    I promise not to rock the boat after this one.

    Appreciate all your patience and attention, 

    • Jared says:

      The idea behind the “crank” comment was that your example 1 is silly since we aren’t anywhere near a reality where we could tackle that problem (Konrad’s point as I read it). Incidentally, he wasn’t calling you a crank unless you are actually claiming to have such a model in hand. I don’t think anyone has “angry ears”, they just aren’t seeing the point.

      In my view a good toy problem is a *simplification* of the reality we find ourselves in (see Andrew’s 8 schools example, or causal graphs with a half dozen variables). The really good ones give some correct intuition as to how methods might perform on more realistic examples. Your example with 100,000 of variables isn’t a good toy problem, it’s mathematical masturbation: Fun for some, and an important step in the development of math, mathematicians and the scientists – including statisticians – who rely on them. But you can’t really expect applied statisticians to engage with it or accept that causal inference methods must perform well when we have correctly specified a graph on 100,000 variables in a system as complex as the US economy. Similarly, the applied statisticians shouldn’t expect mathematicians/theoretical computer scientists/etc to focus on methods which work on problems with real-world scale/complexity/whatever. I think this was implicit in Andrew’s post.

      Finally, neither of your examples seem like a candidate for hierarchical modeling which suggests to me that you haven’t taken Andrew up on his suggestion to familiarize yourself with that (substantial) literature and instead continue to demand new worked examples. I’d double down on his suggestion if you really want to make progress in understanding the methodology, instead of just having arguments in blog comments.

      • Dear Jared,

        You wrote: “neither of your examples seem like a candidate for hierarchical modeling”. This is the most informative answer I got so far from this 150-msg long discussion.

        I have thoughts all along that these simple examples would be easy for hierarchical modeling to solve, and I was hoping to incorporate powerful “partial pooling” methods into my transportability problems. False hopes.

        You summarized it crisply and I now understand that there are certain problems that hierarchical modeling either cannot tackle, or is not ready yet to tackle. This may explain why the problems were dismissed by so many readers  as “toy”, “unrealistic”, “too simplified”, “theoretical”, “mathematical”, and more.

        There were only one or two voices on this blog saying: Hey, if we can’t solve such easy problems, how can we hope to solve “real life” problems of the same nature (namely, transferring experimental results across population). To those voices I say, let’s continue by email.

        Back to the main track, there remains only my humble question on whether Andrew agrees with Manski’s priority: identification first – estimation second.


        • Andrew says:


          This is just rude. You introduced the term “toy problem” into the discussion. I do not believe that I or anyone else used that term except as a reply to you.

          On one hand, your being combative has some benefits, such as that it irritates me enough to provoke multiple responses. On the other hand, I find it irritating for you to first introduce the concept of the toy problem, then say how important you believe toy problems are, and then to claim that others are dismissing your problems as “toy.” (And, “unrealistic” is not a dismissal either, it’s a description of, e.g., a claim that you could have “an exact, yet complex model of the US economy, with 100,000 variables and all relationships (i.e., arrows) correctly specified.”)

          If you do not want to use hierarchical models or partial pooling, that is your decision. The leading statisticians of the twentieth century were Neyman, Pearson, and Fisher. None of them used partial pooling or hierarchical models (well, maybe occasionally, but not much), and they did just fine. Meanwhile, other statisticians will use hierarchical models to partially pool as a compromise between complete pooling and no pooling. It is a big world, big enough for Fisher to have success with his methods, Rubin to have success with his, Efron to have success with his, and so forth.

          If you do not choose to use a particular method, whether it be hierarchical modeling or bootsrapping or anything else, it would be polite for you to simply say so, to say that you do not wish to use a method, that it makes you uncomfortable, whatever. This makes more sense than saying that a method that you do not choose to use “either cannot tackle, or is not ready to tackle” some problems.

          Just in case you have missed this: hierarchical modeling describes certain classes of probability models. Such models can be used in various contexts to combine information across different groups. This can be done in the context of any statistical method that uses probability models. Or, if you are not using probability models (as in some machine-learning procedures that do not specify a data-generating process), there can be partial-pooling procedures that approximate the outcomes of hierarchical modeling without actually specifying a model.

          If you are interested in my priorities in identification and estimation, feel free to read my two books and infer what you will.

          • Dear Andrew, 

            I am sorry that my last message got you “irritated” and I wish to assure you that my intention was not to be disrespectful. I will try to explain as much as possible my point and potential sources of misunderstanding in our communication. 

            1. Toy problems 

            You are right.
            It was me who introduced the term “toy problem” into the discussion as a useful device to see how methods work. In computer science “toy problem” means a simple instance depicting interesting behavior, in which we could understanding every aspect of the problem being analyzed; only then we scale things up to more complex situations. This was how I first introduced the term.

            Somehow, it later received negative connotation, together with labels such as “unrealistic”, and worse.. One needs to be careful – agree. (BTW, the 100,000 variables US economy was an example of a NON-toy-problem, because one does not know what the answer is in advance.)

            2. HM for transporting experimental information 

            The toy example that I was most concerned about was a three-variable example in which we conduct a randomized trial in Los Angeles (LA) and estimate the causal effect of treatment X on outcome Y for every age group Z = z. We now wish to generalize the results to the population of New York City (NYC), but we find that the age distribution in NYC is significantly different than that in LA. The goal is to estimate the causal effect of X on Y in NYC. (No randomized experiment can be conducted in NYC.) 
            I am trying to understand whether HM is geared to handle such cases because the problem before us requires that we infer “treatment effect” in a new and different population, on which no experiments can be conducted (meaning, only non-experimental samples are available there). 

            My current understanding, translated to the HM terminology, is that “no pooling” / “partial pooling” are not applicable here since we do not have samples from experiments in NYC; “complete pooling” is also not an alternative since the populations are different. 

            So, we are dealing here with a mixture of two problems: (1) combining information from two populations (usually without many samples); (2) inferring causal effects from non-experimental study. 

            As you suggested before, it seems that the methods are not competitors. This may in fact be a fertile ground for a symbiosis, since HM excels on (1) and DAGs can handle (2). This is just my current intuition. 

            I hope to have dissolved some misunderstandings in our communication, 

            Appreciating the opportunity,

  12. konrad says:


    Wow, I disappear for a few days and _this_ happens. I wasn’t calling you a crank – as explained by Jared: my point was that it’s not very useful to imagine an example which couldn’t exist (because only a crank could claim to actually have such a model). I am not under the impression that you would ever make such a claim. I’m sorry if my comment offended you.

    Actually I _am_ in favour of starting with toy models, provided one doesn’t lose sight of the real problems to which they will eventually be applied. There is a danger (or rather, a Grand Tradition – e.g. in traditional AI, but _not_ in causal modeling) of focussing too much effort on details that are unimportant in real problems while building a framework that is not useable in practice because it cannot be extended to deal with details that _are_ important in practice. That does not imply that toy problems are not useful, just that one has to bear in mind throughout how they will eventually be extended to realistic problems.

    Re the LA,NYC example – I made several suggestions in the other thread, including ways to expand the problem (which has to be the next step – there is not very much that can be done with the problem as stated, because we are not given enough information to judge which model assumptions are reasonable) – but you don’t seem to like the suggestion that progress can/should be made by expanding the problem statement. To summarise the discussion of that problem:

    – your paper shows what can be done if complete pooling is a sensible model assumption.
    – it was pointed out that complete pooling is probably not a sensible assumption: before using it, one should demonstrate empirically that it is reasonable.
    – it was pointed out that, if complete pooling is a bad assumption, partial pooling may still be a reasonable one – in this case it is still possible to improve over no pooling.
    – as the problem is currently stated, the data required to see if partial pooling is a reasonable assumption is not available: we don’t know enough about the system under investigation to proceed. This is why the problem statement needs to be expanded.

    A key point which I’m not sure you’re agreeing with is that questions about transporting information are questions about which model assumptions to make – and that these are _empirical_ questions.

  13. Dear Konrad,

    I appreciate your offer to expand our discussion on transportability, and under normal circumstances I would have liked to pursue your ideas. However, at this point, I feel that doing so in public might, perhaps, cause more discomfort  than challenge in this group. So, if you understand my position, I would prefer to do it by email and, once we establish a common nomenclature, we can perhaps return to the community with a meaningful description of the challenge.

    And just to avoid potential misunderstandings from others who follow this thread, I can assure you that:
    (1) we are not doing “complete pooling”;
    (2) what we are doing is a delicate mixture of “complete pooling”, ” no pooling”, and “partial pooling”, each conducted as needed on a different part of the data;
    (3) most importantly, the aim of pooling in the HM literature is orthogonal to the problem at hand,  since our problem is not concerned, at least in the first stage, with estimation but with the identification of the target quantity.

    (I don’t have your email, but my messages contain my name/address, if you’re interested in exchanging ideas about these issues.)


  14. K? O'Rourke says:


    This is what Elias should be doing “(2) what we are doing is a delicate mixture of “complete pooling”, ” no pooling”, and “partial pooling”, each conducted as needed on a different part of the data;”

    But in that paper of his, he seems to be doing just complete pooling within age strata and no pooling between strata but rather poststratified weighting. Vocabulary is very poor in this area and its likely Andrew who made the lables “complete pooling”, ” no pooling”, and “partial pooling” more common.

    My vocabulary was common parameter (justifies complete pooling – just of that common parameter!), arbitrary parameter (only no pooling justifiable to every one but Richard Peto) and common in distribution parameter – taking the paramater as being randomly drawn from a distribution with common parameters i.e. exchangeable – (justifies partial pooling).

    The literature is hard to read so its probably a good idea to read it very carefully.
    (I was lucky to learn this in first year undergrad when I mistakening enrolled in a third year philosophy course whoes primary text was Wittenstien’s Tractatus and I missed the drop date as well as the previous 2000+ years of philosophy it presumed.)

    (apologies – no spell checker conveniently available)

    • Andrew says:


      When I was a student, they called it “shrinkage.” I much prefer the term “partial pooling.”

    • JSB says:

      Highly germane to issue of terminology in transportability discussion: “What can be said at all can be said clearly, and what we cannot talk about we must pass over in silence.” Wittenstien
      On the whole, I found this thread to be stimulating, but also also a reminder of why many people regulate discussion of causation to the “silence”.

    • Keith,

      You are absolutely right, all three types of “pooling” should be conducted (“complete pooling”, “no pooling”, “partial pooling”), each on a different part of the data, so as to achieve bias-free estimates. If HM offers a methodology of accomplishing it — the symbiosis can be considered done.

      (Recall: zero bias requires explicit causal assumptions, something we have not touched on in this conversation.)


  15. […] 2. It’s fine that Larry’s favorite methods are used in biostatistics and at Google and Yahoo. I’ve heard that biostatisticians and software companies also use Bayesian methods, maximum likelihoods, chi-squared tests, etc. Lots of methods are useful. The fact that somebody somewhere uses a method doesn’t mean it’s optimal or even a good thing to do in general, but it provides some positive evidence. […]