Straining on the gnat of the prior distribution while swallowing the camel that is the likelihood. (Econometrics edition)

Jason Hawkins writes:

I recently read an article by the econometrician William Greene of NYU and others (in a 2005 book). They state the following:

The key difference between Bayesian and classical approaches is that Bayesians treat the nature of the randomness differently. In the classical view, the randomness is part of the model; it is the heterogeneity of the taste parameters, across individuals. In the Bayesian approach, the randomness ‘represents’ the uncertainty in the mind of the analyst (conjugate priors notwithstanding). Therefore, from the classical viewpoint, there is a ‘true’ distribution of the parameters across individuals. From the Bayesian viewpoint, in principle, there could be two analysts with different, both legitimate, but substantially different priors, who therefore could obtain very different, albeit both legitimate, posteriors.

My understanding is that this statement runs counter to the Berstein-von Mises theorem, which in the wording of Wikipedia “ assumes there is some true probabilistic process that generates the observations, as in frequentism” (my emphasis). Their context is comparing individual parameters from a mixture model, which can be taken from the posterior of a Bayesian inference or (in the frequentist case) obtained through simulation. I was particularly struck by their terming randomness as part of the model in the frequentist approach, which to me reads more as a feature of Bayesian approaches that are driven by uncertainty quantification.

My reply: Yes, I disagree with the above-quoted passage. They are exhibiting a common misunderstanding. I’ll respond with two points:

1. From the Bayesian perspective there also is a true parameter; see for example Appendix B of BDA for a review of the standard asymptotic theory. That relates to Hawkins’s point about the Berstein-von Mises theorem.

2. Greene et al. write, “From the Bayesian viewpoint, in principle, there could be two analysts with different, both legitimate, but substantially different priors, who therefore could obtain very different, albeit both legitimate, posteriors.” The same is true in the classical viewpoint; just replace the word “priors” by “likelihoods” or, more correctly, “data models.” Hire two different econometricians to fit two different models to your data and they can get “very different, albeit both legitimate” inferences.

Hawkins sends another excerpts from the paper:

The Bayesian approach requires the a priori specification of prior distributions for all of the model parameters. In cases where this prior is summarising the results of previous empirical research, specifying the prior distribution is a useful exercise for quantifying previous knowledge (such as the alternative currently chosen). In most circumstances, however, the prior distribution cannot be fully based on previous empirical work. The resulting specification of prior distributions based on the analyst’s subjective beliefs is the most controversial part of Bayesian methodology. Poirier (1988) argues that the subjective Bayesian approach is the only approach consistent with the usual rational actor model to explain individuals’ choices under uncertainty. More importantly, the requirement to specify a prior distribution enforces intellectual rigour on Bayesian practitioners. All empirical work is guided by prior knowledge and the subjective reasons for excluding some variables and observations are usually only implicit in the classical framework. The simplicity of the formula defining the posterior distribution hides some difficult computational problems, explained in Brownstone (2001).

That’s a bit better but it still doesn’t capture the all-important point that that skeptics and subjectivists alike strain on the gnat of the prior distribution while swallowing the camel that is the likelihood.

And this:

Allenby and Rossi (1999) have carried out an extensive Bayesian analysis of discrete brand choice and discussed a number of methodological issues relating to the estimation of individual level preferences. In comparison of the Bayesian and classical methods, they state the simulation based classical methods are likely to be extremely cumbersome and are approximate whereas the Bayesian methods are much simpler and are exact in addition. As to whether the Bayesian estimates are exact while sampling theory estimates are approximate, one must keep in mind what is being characterised by this statement. The two estimators are not competing for measuring the same population quantity with alternative tools. In the Bayesian approach, the ‘exact’ computation is of the analysts posterior belief about the distribution of the parameter (conditioned, one might note on a conjugate prior virtually never formulated based on prior experience), not an exact copy of some now revealed population parameter. The sampling theory ‘estimate’ is of an underlying ‘truth’ also measured with the uncertainty of sampling variability. The virtue of one over the other is not established on any but methodological grounds – no objective, numerical comparison is provided by any of the preceding or the received literature.

Again, I don’t think the framing of Bayesian inference as “belief” is at all helpful. Does the classical statistician or econometrician’s logistic regression model represent his or her “belief”? I don’t think so. It’s not a belief, it’s a model, it’s an assumption.

But I agree with their other point that we should not consider the result of an exact computation to itself be exact. The output depends on the inputs.

We can understand this last point without thinking about statistical inference at all. Just consider a simple problem of measurement, where we estimate the weight of a liquid by weighing an empty jar, then weighing the jar with the liquid in it, then subtracting. Suppose the measured weights are 213 grams and 294 grams, so that the estimated weight of the liquid is 81 grams. The calculation, 294-213=81, is exact, but if the original measurements have error, then that will propagate to the result, so it would not be correct to say that 81 grams is the exact weight.

19 thoughts on “Straining on the gnat of the prior distribution while swallowing the camel that is the likelihood. (Econometrics edition)

  1. It’s well to remember that Bayes’ theorem does not refer to a two-step “update” process. It refers to one and the same data set, which can be partitioned in different ways. The way I look at the use of a prior distribution is this. Using Andrew’s simple measurement of weights, suppose we actually had data from a previous measurement. Then we know how to combine the two, and how to compute their combined variance, etc. No mystery there.

    But suppose we knew there were previous data but we didn’t know exactly what they were. For example, suppose we knew the measured weight difference but not the measurement error. One thing we could do is to assume what that previous measurement error was and then to proceed with the calculation as before. There’s your prior distribution.

    So there could be said to be some “belief” involved, but it’s a belief (or assumption) that a certain model of the previous measurement distribution is close to reality. This viewpoint seems to me to be fully in line with Andrew has written.

    If you didn’t have that previous measurement data but you had some idea about the subject (e.g., you “know” that a soccer ball doesn’t weigh less than 200 g), you could assume some distribution consistent with that knowledge and combine that with your data as if it were real data. It’s still no more than a model of what a “real” measurement might have returned, and in fact it’s a model that you would assign smaller weights to since it’s vaguer.

    It’s models all the way down!

  2. > from the classical viewpoint, there is a ‘true’ distribution of the parameters across individuals. From the Bayesian viewpoint, in principle, there could be two analysts with different, both legitimate, but substantially different priors, who therefore could obtain very different, albeit both legitimate, posteriors.

    Two different Bayesian analysis can indeed produce different probabilities for the (true value of the) parameter being contained in some interval, for example.

    Two different “classical” analysis will never arrive at different probabilities. That doesn’t mean that their answers agree though! It’s just because they don’t ever produce any probability at all for the (true value of the) parameter being contained in some interval.

    • Two classical/frequentist analyses can easily arrive at different probabilities for other important things though, like the probability that a given interval contains the true parameter.

      • I wouldn’t say that classical/frequentist analyses provide probabilities for important things.

        In particular, they don’t provide a probability that a given interval contains the true parameter. The question “what is the probability that the true parameter is contained in the interval [1 2]?” doesn’t make sense in a frequentist setting and I don’t think you can get a better answer than “it either does or it doesn’t – there is no probability there”.

        What they can do provide an interval and provide the answer to the question “if you were to construct intervals in other cases using the same method that has just produced the interval [0.243 1.124] in the present case, how often would the true parameter be contained in those intervals?” but I’m not sure that’s an important thing.

        • “In particular, they don’t provide a probability that a given interval contains the true parameter.”
          Fine, but what if it did provide that probability? How much better would it be if I told you that prob. of some outcome is 33%. At the end of the day, the event happens or doesn’t happen. There is no 33% of anything (Bayesian or not).
          How is it different from freq. interpretation of ‘it either contains the parameter, or not’?
          Might be useful when you compare two probabilities, but ultimately understanding the phenomenon of probability is the same.
          The outside objective world doesn’t care that there is such thing as Bayesian vs. freq. interpretation.

        • “At the end of the day, the event happens or it doesn’t happen”

          Bayesian probabilities about parameters are not events in the physical sense, they don’t “Happen”. The mass of an electron in kg is some precise value. Knowing that there’s a 33% chance that it’s below value x means that when trying to predict stuff you should probably use values around x and a little below. Knowing that theres an 0.00000001 chance that it’s above another value y means you can pretty well ignore values around or above y when trying to figure out plausible outcomes from some collision for example.

          The purpose of Bayesian probability is to have a mechanism for calculating plausibilities from other plausibilities.

        • >> “In particular, they don’t provide a probability that a given interval contains the true parameter.”

          >Fine, but what if it did provide that probability?

          If classical/frequentist analyses did provide those probabilities then they would be Bayesian analysis.

          They would start with pre-data probabilities and incorporate the data probabilities to end up with post-data probabilities.

      • Jonathan. Carlos is right. What you are calling “the problem that a *given* interval contains the true parameter” is in frequentist analysis actually “the frequency with which intervals from theoretical future experiments using a random number generator as a stand-in for the experimental apparatus will contain the true parameter value”

        It doesn’t actually say anything at all about the *given* interval, which either does or doesn’t contain the “true parameter”. Nor does it say anything much about the real experiment, which either is or isn’t well approximated by a random number generator, and typically we don’t know which is the case, though often the random number generators chosen are not particularly good at approximating repeated behavior of real experiments either.

        • We’ve been over this numerous times before. Of course, everything you say is technically correct, but I still say if you have one sample and one analysis and one 95% confidence interval for a parameter, there is a 95% chance that this particular interval is one of the lucky ones that contains the true parameter (with all the other caveats about the additional assumptions in the analysis) – it’s not the “answer” to any meaningful question, but it’s not nothing either. What I fear is that the correct interpretation of the confidence interval ends up saying that you’d be just as well off not having the interval as having it. If you misuse it, then I’d agree, but I don’t think it contains zero information.

        • Dale, it unfortunately depends a lot on the test etc.

          Many tests people use are equivalent to Bayesian tests with some flat prior, if your interval is one that is equivalent to a Bayesian interval then yes it may contain useful information. However it is possible for confidence interval procedures to give you confidence intervals that provably do not contain the parameter. For example you may have a parameter for a weight, and because of weirdness in the confidence procedure and the particular measurement errors and etc the confidence interval produced is entirely negative which is logically impossible.

          So it just depends, and usefulness is definitely often due to the fact that the procedure has a Bayesian interpretation. Bayesian intervals always exclude regions where it’s logically impossible for the parameter to be (if you’re using a logically consistent prior that assigns zero density to those regions)

        • It is easy to construct pathological examples in which confidence intervals can’t contain the real parameter, but I bet it’s much more difficult finding a situation in which anybody in their right mind used and took seriously such an interval.

        • Christian,

          Andrew likes to use the “beautiful parents have more girls” type stuff as examples. There lots and lots of background information tell us that the swings in girl/boy ratio in births that haven’t been selected by artificial means is like a couple tenths of a percent in the highest cases, and yet frequentist results led that researcher to estimates 100x higher if I remember correctly. It’s the same kind of thing.

        • Christian:

          What Daniel said. Another example is extremely optimistic claims about the effectiveness of early childhood intervention. Respected researchers really do take these intervals seriously, even though this represents a fundamental statistical misunderstanding.

          Respected researchers also take Bayesian intervals seriously without reflecting on the unverified and often inappropriate assumptions underlying them.

        • I actually don’t agree with Daniel and Andrew here. Sure there are confidence intervals that don’t make any sense. I’ve even had the “honor” of using one myself and being ridiculed – it was a confidence interval that contained zero for the amount of fraudulent claims when an audit had already found an amount > 0. But the calculated confidence interval contained zero – and the judge in the case noticed the discrepancy and it undermined my testimony. My bad. But I failed to bridge the gap between what the calculated interval looked like and the way a lay person would interpret it. The interval still reflected the fact that the evidence was very noisy and so the estimate was highly uncertain.

          I think the objections to confidence intervals goes too far in two ways. First, it too often ends up with a declaration of how Bayesian statistics make sense whereas frequentist methods do not. Second, there are always examples where the interval can be shown to be impossible and/or incorrectly used. But, as Christian suggests, the perverse examples need not be misused. I could have avoided my embarrassing episode easily without rejecting the estimated interval completely. Isn’t there a distinction between a method that is not correct and the incorrect use of a methodology?

        • The interval still reflected the fact that the evidence was very noisy and so the estimate was highly uncertain.

          You assumed “if my model is true”, when you had already observed it was not. The estimate/interval of a wrong model don’t mean anything.

          There must be models with likelihoods consistent with (or even incorporating) the audit results.

        • Anoneuoid – That is a good point and I agree with you. I don’t think that renders the classical frequentist interval meaningless, but certainly indicates a flaw in the model that I used and a needed improvement. By the way, that particular case involved my analysis of an even poorer analysis that was done based on the audit data. So, my analysis was an imperfect improvement on a pretty poor analysis. That doesn’t excuse it, but the context is important. Even if I used the audit data in my model (e.g., full Bayesian approach), there would still be assumptions and disputes as to what can be said based on the audit.

        • I was referring to “However it is possible for confidence interval procedures to give you confidence intervals that provably do not contain the parameter.” That’s different from confidence intervals that look wildly unrealistic but are not provably off for formal reasons such as negative CIs for parameters that can only be positive.

  3. I still believe it would be most helpful to use the term “Bayesian” for the Bayesian way of computing, involving priors and Bayes’ Theorem, and to distinguish this from the meaning of probabilities, which can be epistemic or aleatory, with various flavours of both of these, most if not all of which can be meaningfully used together with a Bayesian analysis. This would particularly imply that Bayesian statistics is *not* identified with a subjectivist concept of probability, but neither specifically with any other one. Bayesian statistics is not necessarily about beliefs, but it can very well be about beliefs, and is about beliefs rather often.

    Also there may or may not be a true parameter in Bayesian analysis, depending on with what probability interpretation it is run.

    By the way, of course Bayesian analyses using epistemic probability are approximate in the sense that hardly ever a case can be made that the specified model including prior and likelihood fits the individual’s belief (or “objective” prior knowledge) precisely; in most situations rather obviously it doesn’t.

    On the other hand, the frequentist researcher (who by the way may run a Bayesian analysis anyway) of course never knows the truth, and any given information plus data are always compatible with several models, meaning that for sure two different frequentist analyses can arrive at different conclusions when based on different models that are both compatible with the existing evidence. The only thing that is unique in frequentism is the assumed truth, which is assumed but never known, and can as such be used in a rather flexible manner (which may frequentists don’t like to admit, but others may see this as a feature rather than a bug).

    • By the way, similarly I think it is helpful to distinguish frequentism as an interpretation of probability from ways of running inference such as hypothesis testing. I like the distinction between “inverse probability logic” (which ascribes probabilities to hypotheses/models/parameters) and “compatibility logic” (which investigates with what models/parameters/hypotheses the data are compatible) in this respect.

Leave a Reply

Your email address will not be published. Required fields are marked *