More philosophy of Bayes

Konrad Scheffler writes:

I was interested by your paper “Induction and deduction in Bayesian data analysis” and was wondering if you would entertain a few questions:

– Under the banner of objective Bayesianism, I would posit something like this as a description of Bayesian inference:

“Objective Bayesian probability is not a degree of belief (which would necessarily be subjective) but a measure of the plausibility of a hypothesis, conditional on a formally specified information state. One way of specifying a formal information state is to specify a model, which involves specifying both a prior distribution (typically for a set of unobserved variables) and a likelihood function (typically for a set of observed variables, conditioned on the values of the unobserved variables). Bayesian inference involves calculating the objective degree of plausibility of a hypothesis (typically the truth value of the hypothesis is a function of the variables mentioned above) given such an information state.

We are free to calculate probabilities conditioned on different information states and use these to argue that one information state corresponds more closely than another to a given real-world (i.e. not formally specified) information state. In this case, the probability calculations are part of Bayesian inference but the subsequent argument is not.

Alternatively we may calculate p-values conditional on an information state (via posterior predictive checking) and use them to draw conclusions about the degree to which the information state is informative about the real world. Again, the calculation forms part of Bayesian inference, but the interpretation of the p-values and the decision about which formal information states to investigate next do not.

In this view the scientific process is informed by, but does not exclusively consist of, statistical analysis. The statistical analysis is objective, but the rest of the process is not.”

I would not have thought this type of description should be particularly controversial, but as you point out the popular view seems to focus exclusively on subjective Bayesianism and I’m not sure where I would even find a similar description of the objective Bayesian viewpoint. What do you think?

– I am puzzled by your comments on coherence: you seem to be using the term to refer to any decisions not prescribed by probability theory (i.e. the part of the scientific process which I described above as being outside the ambit of statistical analysis), and you criticize all statistical approaches for not fully prescribing how to do science. Does my strategy of only claiming coherence for the part I am calling statistical analysis fix the problem?

– Regarding your question on the continuous/discrete distinction, doesn’t it make more sense to instead distinguish between numerical and categorical variables (where the former are defined on an ordered set and the latter on an unordered set)? Then discretising a continuous description (as we do in practical computation anyway) does not represent a qualitative change. But when a numerical variable can be replaced with a categorical variable without changing the model (i.e. the order of the numbers don’t matter) you can consider it an index variable. Or am I missing your point here?

– A technical point: you claim that no prior distribution can completely reflect prior knowledge. This may be true for humans, but as I understand it the point of the “robot” formulation used by Jaynes is to circumvent exactly this objection. Jaynes discusses ball-and-urn examples where the idea is that the prior distributions are exact representations of the robot’s knowledge state. He further envisions finding such “objective” priors (that completely represent realistic knowledge states) for much more complex problems. Would you accept this argument?

My reply:

1. I’m not so interested in expressing “the plausibility of a hypothesis.” I understand the appeal of such calculations but they usually don’t make much sense to me. The marginal probability of data given model, p(y|M), typically depends strongly on aspects of the prior distribution that have essentially no impact on posterior inferences given the model. In practice this seems to lead to arbitrary rules. I guess what I’m saying is that the concept of “plausibility” isn’t so clearly defined.

2. “Coherence” can mean different things. In a Bayesian context, inferences are coherent if they are consistent with some single overarching probability distribution. Here’s an example of incoherence: we model some data with a normal distribution but if the rate of outliers exceeds some threshold, we switch to a t distribution. That’s not coherent. The coherent thing would’ve been to start with the t distribution (if necessary, with some prior distribution that favored a large number of degrees of freedom). In practice, though, we can’t do everything at once; we uncork the larger model only when it seems needed. Hence the incoherence.

3. In practice, yes, we distinguish between ordered and unordered variables. When I was discussing continuous and discrete variables in my article, my point was that, on the one hand, I was criticizing discrete model averaging and promoting continuous model expansion instead, but on the other hand, if you discretize a model finely enough, it’s not clear where discrete model averaging ends and continuous modeling begins. Ultimately it comes down to setting up a reasonable joint prior distribution on the parameters at different levels of the model. What frustrates me is when people talk glibly about model averaging, without even recognizing this difficulty.

4. I’m not familiar with Jaynes’s robot argument so I can’t comment on it. My ideas on posterior predictive checking have been much influenced by Jaynes’s approach of boldly writing down a full model, prior and all (in contrast, statisticians, even Bayesians, tend to be embarrassed about their priors) and then, where the data contradict the model, taking that as a recognition that there is important information not yet in the model, and fixing the model to include that information.

Scheffler responds:

On item 1 above, I’m not quite sure what your point is here – the prior (which I think is best considered to be part of the model) may or may not have a strong effect on a given posterior inference. This is why I think it’s important to emphasize that posterior probabilities are conditioned on the model (this seems not to be emphasized in subjective Bayesianism).

The concept of plausibility is defined very clearly (to me, at least) by Jaynes in the early chapters of his 2003 book, and it leads to rules that are definitely not arbitrary. I would recommend reading it since it seems generally consistent with your approach, and since you like other aspects of his approach.

Regarding item 2, I agree that analysing a single data set with different distributions applied to different points is incoherent, but I haven’t seen anyone do this. I don’t think it’s incoherent to use different distributions for different data sets, unless you are assuming that they are sampled from the same underlying distribution (in which case you could equally well consider them to be part of the same data set). I also don’t think it’s incoherent to switch to a better model after discovering that it is better, provided you analyse the full data set with that model. So I’m not convinced that what people do in practice are incoherent.

My quick replies:

My point in 1 is that I don’t trust the marginal probabilities, Bayes factors, etc., that come from most statistical analyses because they depend on aspects of the model that that have essentially no impact on posterior inferences given the model. In practice this seems to lead to arbitrary rules.

I’m not familiar with Jaynes’s idea of plausibility but if it’s like his other work, I’m guessing it centers on defining models using strong assumptions that are clearly stated, which I like.

Regarding item 2, my example did not involve analyzing a single data set with different distributions applied to different points. I was talking about the very common procedure of using a single model for all the data points, but choosing or rejecting the model based on how it fits the data. Reasonable practice but not coherent.

10 thoughts on “More philosophy of Bayes

  1. I second the suggestion to read Jaynes’ 2003 book. You might also like a paper by the Fields medalist David Mumford called “The Dawning of the Age of Stochasticity.” It’s available from his faculty page at Brown if you Google for it, and very worth the read. Mumford cites Jaynes’ plausibility derivation (which really is due to Cox) and he had a quote that I’ve always liked:

    “We may summarize this result as saying that probabilities are the normative theory of plausibility, i.e., if we enforce natural rules of internal consistency on any homespun idea of plausibility, we end up with a true probability model.”

  2. That’s good, but it can turned round. Once you learn that plausibility lends itself to a probability formulation, then the idea of plausibility itself looks redundant, or at least secondary.

    (The “Cox” referred to above is of course not me, but R.T. Cox, not even a relation.)

  3. Interesting. My quick thoughts:

    ‘…the truth value of the hypothesis is a function of the variables mentioned above’

    since we’re already talking about Jaynes, I think that he would have been the first to point out that this seems to be a case of the mind-projection fallacay. The truth value of a hypothesis is 0 or 1 and isn’t a function of anything except the state of reality. What you seem to mean here is the probability for the hypothesis, not the same thing.

    ‘no prior distribution can completely reflect prior knowledge’

    I think this applies equally to humans and robots (is there really a difference?) The uniform distribution when rolling a dice, for example, can not be justified or derived until we give details of the dice and the method of rolling it. To completely encode our knowledge requires both specifying the problem and assigning a prior.

    ‘concept of plausibility is defined very clearly’

    My understanding is that the plausibility, A|B is arbitrary, P(A|B) is an arbitrary function of the plausibility, but numerical values and rules of combination can nonetheless be determined for the probabilities from logical considerations. Maybe this is not in conflict with the concept being well defined, though.

    • On page 13 of PTLOS Jaynes writes:

      “At this point, a logician might object to our notation, saying that the symbol A has been defined as standing for some fixed proposition, whose truth cannot change; so if we wish to consider logic function, then instead of writing C=f(A,B), we should introduce new symbols and write z = f(x,y), where x, y, z, are ‘statement variables’ for which various specific statements A, B, C may be substituted… instead of a statement variable we use a variable statement.”

      A charitable reading of Scheffler’s use of the term “hypothesis” is as a ‘variable statement’ in Jaynes’s sense. For example, for selected a, b in R and unknown theta in R, the truth value of the proposition “a < theta < b" is in some sense a function of the value of theta.

  4. Thanks for posting this, Andrew. If I may continue the discussion a little:

    “I don’t trust the marginal probabilities, Bayes factors, etc.” – but these _are_ posterior inferences under the model. Like any other posterior inference, they may or may not be relevant to a particular question at hand; when they are, it is important to check whether they are sensitive to aspects of the model specification of which we are uncertain. Also, “hypothesis” refers to _any_ hypothesis – the hypothesis of interest may be whether or not the model itself reflects reality, but often it is whether or not a particular model parameter lies in a particular range. Inferring the posterior distribution of the value of a parameter is the same thing as evaluating the probabilities of a set of such hypotheses. (When the parameter is continuous, you just take the limit as the ranges go to zero and the number of hypotheses goes to infinity.)

    But this is not really related to my original point: I was assuming that we are agreed on interpreting probabilities as measures of plausibility or degree-of-belief (the usual Bayesian interpretation) rather than as physical properties of the system being described (the usual frequentist interpretation), and stating a preference (in the Bayesian context) for emphasising that we are conditioning on a knowledge state and that this knowledge state should be specified explicitly (rather than left implicit, as in subjective Bayesianism). In particular I was trying to emphasize that the model itself is part of this knowledge state. Posterior predictive checking then amounts to exploring the consequences of an assumed knowledge state, with the aim of deciding whether this knowledge state is a good approximation of our actual (not formally specified) knowledge state. This decision (is it a “good” approximation?) is subjective and therefore I prefer to think of it as _not_ being part of the inference process.

    Re Jaynes’s plausibility idea: it’s not about defining models, but about defining what we mean by “probability”. The assumptions are indeed clearly stated, but turn out to be no stronger than (in fact, equivalent to, but arguably more intuitive than) the Kolmogorov axioms.

    Re item 2: In that case I think we are essentially agreed here, except for what we choose to label “inference”. In the interpretation I am advocating, the step you are calling incoherent (choosing or rejecting the model without having first specified the set of all models under consideration, with accompanying model priors) is not part of statistical inference. I prefer to call it subjective rather than incoherent; either way, it is this subjectivity/incoherence which makes me want to declare it outside the scope of “inference”.

  5. @Nick: agreed, we don’t want to work (i.e. perform calculations) with a plausibility function – it’s use is as a concept to aid interpretation of probabilities. In a Bayesian setting this ought to be uncontroversial (I think?); my aim was more to focus on the idea that we can condition on any specifiable information state (rather than just ones we can justify as reflecting the knowledge of actual humans).

    @tom: “What you seem to mean here is the probability for the hypothesis”: yes, I meant probability or plausibility, not truth value – thanks for the correction.

    “‘no prior distribution can completely reflect prior knowledge’: I think this applies equally to humans and robots (is there really a difference?)”: Yes there is a difference – there are many knowledge states that are reflected exactly by a prior distribution, and those are exactly the knowledge states that we can usefully work with. But for any such knowledge state it may be the case that it is not held by any human, in which case its use would be hard to justify in a subjective Bayesian setting. The point of the robot is that we can assign _arbitrary_ knowledge states to it, regardless of whether those knowledge states reflect our own knowledge.

    “The uniform distribution when rolling a dice, for example, can not be justified or derived until we give details of the dice and the method of rolling it.” – No, Jaynes justifies it via the principle of indifference. Details of the experimental setup are relevant when known, but when the set of possible outcomes is the only available information we get the uniform distribution.

    “To completely encode our knowledge requires both specifying the problem and assigning a prior.” – Even if details of the problem are unspecified we still have an information state; Jaynes demonstrates cases where we know exactly what that information state is (loosely, states of ignorance correspond to weak priors).

    “the plausibility is arbitrary”: Yes, non-unique but nonetheless clearly defined. The point is that the argument is a mathematical one, not just some fuzzy appeal to intuition.

  6. It’s definitely true that Jaynes followed the approach Gelman associates with him in a big way. Not only in traditional Statistical analysis but in Physics as well. For example, it’s well known that if there is a relevant Integral of the Motion of a physical system that we don’t know about then it would potentially invalidate predictions from Statistical Mechanics.

    Many thoughtful people viewed this as a major problem for Statistical Mechanics. Jaynes viewed it as a major bonus, since we could make predictions from Statistical Mechanics and see if they agree with reality. If they don’t then that is major evidence for the presence of unknown constraints like a new Integrals of the Motion. Historically, the initial inklings for Quantum Mechanics came from just such a process (Quantization providing just such a hidden constraint).

    It should be noted though that Jaynes had no problem with Bayes factors. There’s an entire chapter on Model Selection in his book and he would use them adroitly whenever he felt the need.

    I think it’s true that if:

    Q_g = {set of all statistical questions Gelman wants answered}

    Q_bf = {set of all statistical questions who answer involves Bayes Factors}

    Then the intersection of Q_g and Q_bf is basically empty. Jaynes on the other hand wasn’t working in the social sciences and he ran across more problems were Bayes Factors were relevant.

    • Along these lines, I think Appendix B of Jaynes’ 2003 book was one of the most beautiful things that I ever read, and certainly rekindled my own love for probability after being brow beaten with formal probability theory for two years in grad school. He (I think successfully) defends the idea that what matters is just whatever works. There’s no special reason to care about sigma algebra or measurability unless they work. If Bayes’ factors faithfully correspond to reality, then use them. When they don’t, don’t use them. Of course, mathematics can have an aesthetic appeal and this can serve perfectly well as a justification to study it. But that doesn’t mean that a probabilist is required to organize results into aesthetically pleasing generalizations. I think we missed out a good 20-30 years of progress in the mid 1900s in probability theory because folks thought that the mathematical formalism intrinsically mattered. That you could never say something like “assume f(x) is smooth enough for me to do the manipulations I’m about to do” is a real paper. Having to chase down what those smoothness constraints will be may have no bearing on the work at all, especially when most of the paper’s subsequent integrals will just be Riemann or something silly like that. Yet people were expected for a long time to spend their precious creative energy doing just that.

  7. About switching from a normal data model to a T data model: Is’nt it the case
    that the T model is much more encompassing, that is, it gives non-negligible probs to a much larger variety of data sets than the normal model, we can just use the T model all the time. If the data “looks normal” then the result will not be to
    different from what a normal model will give, while the opposite is very far from true?

    Distance betwen models can be measured by the Kullback-Leibler distance, which is asymmetric. Lets say our model is (density) g, while
    the data are generated by (density) f. Let the observation be X. Then the log-likelihood ratio
    for the “true model” f against the assumed model g is log( f(X)/g(X) ), which expectation
    is the KL-divergence $KL(f , g) = \int f(x)\log( f(x)/g(x) )\,dx$

    which is easy to calculate in R:
    KL.div KL.div(dnorm, dcauchy, lower=-35, upper=+35)
    [1] 0.2592445
    > KL.div( dcauchy, dnorm, lower=-35, upper=+35)
    [1] 9.207638

    So the distance of normal observation from cauchy (t_1) model is small , while the distance of
    Cauchy observation from assumed normal model is very large! Or with other words: With cauchy data we will reject a normal model, while with normal data we will not (expect to) reject a cauchy model.

  8. something happened with the piece above. Here is the R function again:

    KL.div <- function(f, g, lower=-Inf, upper=+Inf, …) {
    integrate(function(x) f(x)*log(f(x)/g(x)), lower=lower, upper=upper, …)$value
    }

Comments are closed.