Bayesian inference is not what you think it is!

Bayesian statistics means different things to different people. To non-statisticians, Bayes is about assigning probabilities to scientific hypotheses. For example, one summary of moderately-informed opinion says:

“Bayesian inference uses aspects of the scientific method, which involves collecting evidence that is meant to be consistent or inconsistent with a given hypothesis. As evidence accumulates, the degree of belief in a hypothesis ought to change. . . . proponents of Bayesian inference say that it can be used to discriminate between conflicting hypotheses: hypotheses with very high support should be accepted as true and those with very low support should be rejected as false.” — Wikipedia article on “Bayesian inference”

It may come as a surprise to some readers that this is not how I view Bayesian inference at all!

In my work, I treat Bayesian methods as a souped-up least squares or maximum likelihood, a way to perform better inference within a model. For example, when modeling the rates of which New York City police stopped people of different ethnic groups (Gelman, Fagan, and Kiss, 2007), we did not attempt to compute the probability of the hypothesis that police stopped blacks, hispanics, and whites at the same rate; rather, we estimated these different probabilities and assessed how they varied for different types of crimes and in different parts of the city. In estimating social networks using survey data (Zheng, Salganik, and Gelman, 2006), we did not estimate our degree of belief in the hypothesis that people formed social ties at random; instead, we fit a model in which people varied in their social networks, and compared our fitted model to predictions from the simpler model. And so on. I have worked on hundreds of applied research projects, but I don’t know that I’ve ever accepted a hypothesis as “true” (as suggested is appropriate in the Wikipedia quote above). Conversely, I would not reject a model just because it is “false”! False models help us learn about the world; that’s what much of statistics is about (as in the famous quote of Box and Draper, 1987, that “all models are wrong, but some are useful”). Or else, I don’t know what all that stuff in classical statistics about maximum likelihood from the Poisson distribution, etc etc, is all for.

I’m not trying to argue that the Wikipedians’ interpretation is wrong, just that their view focuses on what seems to me to be a small part of what Bayesian statistics is about. It also represents a view of the philosophy of science with which I disagree, but this review is not the place for such a discussion. What is relevant here–and, again, which I suspect will be a surprise to many readers who are not practicing applied statisticians–is that what is in Bayesian statistics textbooks is much different from what outsiders think is important about Bayesian inference, or Bayesian data analysis.

See here for more, in the context of a book review from 2007.

Also relevant is our 2013 article on philosophy and the practice of Bayesian statistics.

It’s a funny thing to be applying a method that otherwise well-informed people have such a misguided view about.

53 thoughts on “Bayesian inference is not what you think it is!

  1. I’ll bite. I’m not a Bayesian (at least not in practice since I lack the background), but the only part of the quoted statement that seems correct to me is “As evidence accumulates, the degree of belief in a hypothesis ought to change.” The rest just seems wrong. Rather than starting with a hypothesis to then match evidence against (which seems to be the NHST method, just with a hypothesis of no effect), I though Bayesian models start with a prior probability distribution, based on existing knowledge. It has always been a bit fuzzy to me exactly what qualifies as this prior, but it makes philosophical sense that the prior of no effect is not the best starting point. I’m not sure I get Andrew’s description of his research as “we estimated these different probabilities and assessed how they varied for different types of crimes and in different parts of the city.” This makes it sound like there was no prior involved, but wasn’t there? Is the issue whether the prior counts as a “hypothesis” or not? That confuses me, but strikes me as a semantic matter.

    I don’t know where the piece of accepting hypotheses with strong support as “true” came from, but that certainly doesn’t match anything I have read on this blog. AI should be able to do better than Wikipedia here (or perhaps it is AI that is providing that Wikipedia entry?).

    • I don’t think you can use Bayes’ rule without a prior, but you can just select a flat prior, denoting you are quite unsure of the shape of the distribution. I agree that whether it counts as a hypothesis is pure semantics, though. I assume that, rather than a flat prior, you could have a null hypothesis formalised as a prior, and your posterior distribution would be shrunk towards 0 as a result, right?

    • I think of the priors as weights on the likelihoods. And what bayes rule does is normalize a given weight*likelihood to the sum of all the others you are currently considering. That denominator can change as new possibilities or evidence affecting the other priors/likelihoods comes up even if it does not affect the numerator values.

      For D = data and H[0:n] = hypotheses 0:n:

      p(H[0] | D) = p(H[0]) * p(D | H[0]) / sum( p(H[0:n])*p(D | H[0:n]) )

      It is pretty basic stuff, but if you look at the simplified form it seems mysterious:

      P(A | B) = p(A) * p(B | A) / p(B)

      In that case you wonder WTF is p(B)? And once the maths folk insert infinitesimals it starts looking even more mysterious. Then you can start deducing weird stuff since they implicitly assumed infinitely precise data is possible, so introduced a contradiction. That is great as a computationally efficient approximation though.

      • Forgot to mention. The first form also makes clear what a uniform prior does: It makes all the priors cancel out to simplify the problem.

        And ignoring certain hypotheses is just setting them to zero in the denominator. Same as if you have y = x/(x + .00001*x), and x is only known to +/- .1*x. May as well drop it from the denominator. And this is indeed what you use as a heuristic in everyday decisions I bet.

      • David:

        There are a few more steps to Bayesian statistics!

        You mention two of the steps of Bayesian inference conditional on a model–“Model your unknowns as random variables,” and “Condition on the data”–but you forgot one more step which comes between your two steps; this is, “Model the data-generating process as a probability distribution.”

        There are also other steps of Bayesian statistics that go beyond inference conditional on a model. These include model checking, model understanding, model expansion, and model averaging. Also post-processing for predictions, causal inference, and decision analysis.

        That’s a lot. Which is why we have to keep writing books on the topic!

        • “Model the data-generating process as a probability distribution” doesn’t seem something that distinguishes Bayesian statistics for alternative approaches. And arguably writing a suitable model is part of step 1 – the “unknowns” are the parameters of that model.

        • Andrew: As Carlos said, you appear to be listing things that are part of statistics in general (or implied by what I wrote). I was explaining how someone who knew some statistics (e.g., took a frequentist course in school) should do Bayesian statistics. Of course, you need a model for your data and you need to do whatever you can to check your model. I assumed from what Dale wrote that he knows some probability and statistics.

  2. As someone just getting into Bayesian methods, I can’t say I have ever been interested in testing the “truth” of a model but rather evaluating the model’s ability to explain the observed data and make informed guesses about its underlying data generating mechanism. I actually find Bayes factors to be a somewhat contrived way of aping p-values (at least, the way I’ve seen them used in practice).

    Generally, I think the strength of Bayesian statistics is in the extremely flexible way it is able to model any given phenomenon, including informed guesses about its prior. Having working knowledge about the form of the posterior distribution is very useful, in my experience.

    That all said, when doing any kind of prediction, I reach first and foremost for regular old frequentist methods. Bayesian stats is slow when you have a lot of data and parameters!

    • Robin:

      Regarding your final paragraph: It depends what you’re trying to predict. For small-area estimation (that is, estimating quantities for which the data are locally sparse), I think a Bayesian approach or the equivalent is the way to go.

  3. Here’s my (probably naive) way of thinking about all this. Suppose we make a measurement of some quantity, say its length. We know the variance of the measurement. We find a previous measurement with (of course) a different mean. We know how to combine them by weighting by their inverse variances.

    Now consider that we don’t have that previous measurement but we have some way of estimating what it would have been. We can combine that estimate with our new one, using the estimated mean and variance instead of actual measurements.

    The next step is to realize that the estimated measurement is best looked at as drawn from some distribution rather than a point measurement. Note that we are making an assumption – that’s a model – about that measurement (that is, what it would have been had we been able to make it) . We wouldn’t have to make that assumption if we had the actual data but we don’t. So this is the best we can do.

    Looked at this way, there is virtually no difference (in principle!) between a frequentist and a Bayesian approach. You need to make some progress in the absence of complete data, and this is a way to do so.

  4. I’ve always thought of Bayesian inference as nothing more than a flexible, intuitive way to derive a distribution over the potential outcomes of a decision. If that’s wrong, I don’t want to be right.

  5. I know you fight the good fight against both the Philosophical Bayesian (“Bayesianism is just consistent rational thinking”) and the Hypothesis Testing Bayesian (ie the Wikipedia quote in the OP). But when you’re done with any analysis you have a set of beliefs. Those beliefs have at least some form of uncertainty attached to them, unless we adopt the purely semantic convention that the belief includes the uncertainty about the belief, in which case what I’m calling “set of beliefs” is only the non-uncertainty part.

    Both the Hypothesis Testing Bayesian and the Philosophival Bayesian are, like you, also left with beliefs and uncertainties about those beliefs, and they turn out to be identical, holding the priors and data as fixed!

    Even the textbook NHST frequentist ends up with beliefs and a sort of deracinated uncertainty about those beliefs, and the beliefs very often coincide, particulatly when the methodologies use maximum likelihood and the likelihood function is reasonably symmetric, although the uncertainty about the beliefs often looks very different.

    But isn’t all science (indeed all argumentation) ultimately about: (a) here is what I believe; (b) here is my evidence; and (c) here is the methodology I use to turn evidence into beliefs? It’s not clear to me at this level of abstraction you have much of a distinction at all with the Wikipedian Bayesian or the Philosophical Bayesian. What distinction you have is simply the sort of beliefs you are willing to countenance.

    • Jonathan:

      I think of science as a combination of experimentation, theorizing, and deduction from data and theories. Scientists are people, and as people we have beliefs, emotions, tastes, ways of looking at the world, etc., so, yeah, these are all important parts of science too. Nothing particularly Bayesian about that, though.

  6. I think there’s a useful analogy here with consequentialist v. deontological theories of morality. The deontological view (or family of views) is that there are right and wrong actions (or righter/wronger actions), and one ought to do the right thing, where rightness is an intrinsic property of the action and not a function of the consequences of the action. The consequentialist view (or family of views) is that there are good and bad outcomes (or better/worse outcomes), and one ought to take the action which promotes the best outcome.

    Similarly, in statistics one may take the view that the optimal analysis of a data set is the one that implements the right algorithm (or meta-algorithm), where rightness is an intrinsic property of the algorithm and not a function of how the algorithm performs. Alternately, one may take the view that the optimal analysis of a data set is the one that tends to perform best.

    Almost every applied statistician takes the second view (call it “statistical consequentialism”). Meanwhile, non-statisticians interested in Bayesian reasoning (e.g., philosophers) frequently take the first view (call it “deontological statistics”). Their view is that Bayesian inference represents rationality as such, so failure to use Bayes is inherently irrational.

    As with morality, these two views may be largely observationally equivalent, in the sense that in the vast majority of situations, the same (or an approximately equivalent) course of action is recommended by both approaches. But where you stand is revealed by the cases where they diverge. In moral philosophy we elicit this with thought experiments, while in statistics we elicit this by considering data generating mechanisms in which Bayes (or how Bayes would ordinarily be used) performs poorly (in the usual frequentist senses).

    I believe you’ve used this definition of Bayesian statistician in the past, Andrew—someone who uses Bayes even when it’s inappropriate. While this is not the most interesting debate in the world, I think this is where statisticians who identifies as frequentists tend to disagree with Bayesians—as a frequentist you are free to use whatever procedure provides the performance characteristics you seek under the assumed model for the data generator. Often that will be something Bayesian, or Bayesian-equivalent, but sometimes it will not, and a procedure that performs well in such situations isn’t any worse for not being Bayesian.

    So you get a version of the consequentialist v. deontological debate in statistics, but the latter group is more pragmatic (actually caring about performance) than their philosophical brethren.

    • > a procedure that performs well in such situations isn’t any worse for not being Bayesian.

      What would be an example of situation where a frequentist procedure is not equivalent to something Bayesian but better?

      • This would be one example:

        https://normaldeviate.wordpress.com/2012/10/11/the-robins-ritov-example-a-post-mortem/

        There is something of a general principle here, namely that if you have a non-Bayesian estimator you like, you can usually rig the Bayesian apparatus to output something more or less the same with enough creativity. But these reverse-engineered implementations of Bayes are in some cases not ones any Bayesian would have been inclined to implement except to mimic the non-Bayesian estimator. That would be a case where I would say Bayes fails but frequentism does not (because there exists something with the desired frequentist properties).

        • This example is pretty old and we had a bunch of discussions about it back when it came out, none of which I remember the details of. I wasn’t convinced that Bayes failed in any sense in this problem. Having reviewed it for 10 mins or so, it seems that the Frequentist makes a demand of the Bayesian that the Bayesian simply doesn’t have to follow. The failure to know theta means that the X is not informative for P(y=1) and the Bayesian can simply ignore Theta and X and comes up with a perfectly fine posterior for Y=1 from the data on Y. The “demand” that the Bayesian learn theta in the process and the “gotcha” that this is impossible is like “demanding” that an engineer learn the atomic-level crystal structure of a beam before pronouncing a distribution over the likely strength. No thanks not gonna bite.

          Of course I may be misremembering the problem. You can probably do a web-search of this blog to find a lot of discussion on this problem if you like. Pretty sure both Carlos and I had things to say about it.

        • Thanks for the pointer. Looking at the problem statement at https://normaldeviate.wordpress.com/2012/08/28/robins-and-wasserman-respond-to-a-nobel-prize-winner/ it seems that the multidimensional X and censored Y are red herrings.

          One can show the same “failure” of Bayesian statistics with the much simpler problem of an infinite number of biased coins. You randomly select a few and flip them. (If one can randomly pick a real number in [0 1] one can surely also pick a coin randomly from that countably infinite set.) What’s the expectation of heads for a new random selection and flip?

          A frequentist may take the sample average.

          Wasserman’s Bayesian can only calculate the average of an infinite number of priors and that number won’t change after observing a few outcomes.

          Other Bayesians may notice that the problem can be addressed differently. The labels of the coins are irrelevant and they only need to model the distribution of the “bias” over whole set of coins. The observed outcomes are informative regarding that distribution. There are multiple ways of creating a meaningful finite parametrisation of that density function, like dividing the [0 1] domain on segments or with a basis of functions. The solution to the problem they are asked to solve is the first moment of that distribution.

          They may then be accused of “frequentist pursuit”, forcing an answer with good frequentist behavior, reverse-engineeering, etc. but that seems unfair.

        • This morning a friend sent me Dan Simpson’s blog post from a couple years back, where he shows that the fully Bayesian solution to the problem gets you essentially the estimator that Wasserman suggests https://dansblog.netlify.app/posts/2022-11-12-robins-ritov/robins-ritov

          Since the J is a finite but large count, you don’t have the problem of averaging over an infinite set of priors, and you can just use your posterior distribution to estimate the quantity of interest to Wasserman and you get a very closely related estimate to the H-T estimator.

        • The more I think about Wasserman’s example the less I understand what’s the point.

          It seems that his Bayesian analysis makes some strong assumption. If the assumption is right, the estimator which doesn’t depend on the data is fine and good frequentist behavior won’t give us a better estimator. If the assumption is unwarranted, relaxing it is the natural way to improve model. Doing so it’s not forcing the Bayesian answer to be the frequentist answer.

          He writes that “θ:[0,1]^d -> [0,1]” and “To do a Bayesian analysis, we put some prior W on Θ.” He doesn’t give more details on how such a prior looks like as far as I can see but later he writes:

          “To see that the likelihood has no information, consider a simpler case where θ(x) is a function on [0,1]. Now discretize the interval into many small bins. Let B be the number of bins. We can then replace the function θ with a high-dimensional vector θ = (θ_1, …, θ_B). With n less than B, most bins are empty. The data contain no information for most of the θ‘s.”

          The multidimensional nature of the original problem is indeed a distraction and the discrete case is much easier to reason about than the continuous one so let’s rewrite the setup statements above for B bins as he does.

          The function θ:{1, …, B} -> [0,1] can now simply be written as B parameters θ_j with values in [0,1].

          How does a prior W on Θ look like? It will be a joint density p(θ_1, …, θ_B).

          Is it true that “The data contain no information for most of the θ‘s.”? Not in general!

          It will be true if we write p(θ_1, …, θ_B) as p(θ_1)…p(θ_B). In that case observations corresponding to one bucket will have an effect only on the posterior for that particular θ_j. When there are infinite buckets the average of all the averages will still be the average of the prior averages.

          But why does he [force his Bayesian to] assume that the joint prior distribution is the product of independent priors for every point?

          If it’s not appropriate to assume that the different θ’s are independent we have to relax that assumption. There are many ways to model more complex prior distributions with inter-dependence between the different θ’s, the point is that it won’t be true that “The data contain no information for most of the θ‘s.”

          If it is appropriate to assume that each of those (infinite when B goes to infinity) θ is independent from the others then the posterior distribution for ψ won’t be different from the prior distribution for ψ (an infinitely narrow Gaussian centered at the average of the averages of the θ priors).

          That’s not a failure of Bayesian inference, it’s a direct consequence of the prior he’s implicitly using (unless I’m missing something).

        • Carlos. Indeed, let’s use nonstandard analysis (which is basically to say that we use an enormous number of discrete bins as you did, and then look at the consequences).

          Let i from 1 to N be a nonstandard number of bins representing the x_i values between 0 and 1. Let p_i be the probability density on the value of theta(x_i). the function theta should map to [0,1] so each one is maybe something like a Beta distribution. We make the assumption that these are independent. First let’s look at the derivative of this function at any given x_i, that’d be (theta(x_i+1) – theta(x_i)) / dx with dx = 1/N. Since the theta_i+1 and theta_i have independent priors with a usual density, the probability that they are infinitesimally close to each other is infinitesimal (suppose we know theta(x_i) then the probability that theta(x_i+1) is within an infinitesimal of theta(x_i) is independent of theta(x_i) and therefore p_(i+1)(theta(x_i))dx which is infinitesimal.

          And so we have an appreciable number (theta_i+1 – theta_i) divided by dx (or multiplied by N) so that’s nonstandard and so the derivative is almost everywhere nonstandard.

          So, an independent prior in this case is equivalent to the assertion that you are 100% positive that this function is everywhere nondifferentiable. If the priors were independent normals with infinitesimal width sqrt(dx) then we’d have “the derivative of the brownian motion” which is not a function it’s a very very rough measure. The proposed prior on theta is *even rougher* because the standard deviation of the individual increments isn’t constrained to be infinitesimal. So his independent prior says that we are 100% sure that the function is everywhere non-differentiable and that the increments over any infinitesimal dx are actually appreciable. Basically it has an appreciable jump *everywhere*.

          Now, if we aren’t talking about a nonstandard number of boxes but just a large finite one, then all we learn is that when there’s a large number of things you want to learn about, but you have a small amount of data, you can only learn about some of the things until you get a large number of data points.

          What indeed is the point?

        • > let’s use nonstandard analysis

          No, thanks. One doesn’t need differential analysis of any kind to notice that in that problem as stated no meaning is given to the symbols and the functions have no structure and there is no difference in considering a function with domain [0,1]^d, or [0,1] or any bijection of those onto the real line or whatever. The buckets don’t mean anything either in relation to those continuous problems, they provide just a simpler problem (including a way to pass to an infinite problem).

        • Carlos. But the point is, in the case where you have a continuous function, the prior “every point on this function is unrelated to every other point on this function” is a weird prior that doesn’t even result in the description of a reasonable function. No-one would use that prior for anything. It doesn’t make sense as a prior.

          My impression of the original problem is something like this: break the map of the US up into 10×10 mile squares, and sample a small number of these squares according to some coinflips, and measure something about the people in those squares. Maybe income, or whether they’re housed or something. Sure this isn’t infinite dimensional, but its a large number of dimensions. Supposedly around 3.1M square miles, so 31,000 boxes. But you’d still likely say that nearby boxes are more likely to have similar measures. That there are likely regions of space where no-one lives at all (plains, mountains, middle of a lake, etc) and that if you have a region like that, it’s more likely the neighbor region is similar… Etc.

          The statement of the problem in terms of some abstract pure mathematical statement belies the nature of the weirdness of those mathematical assumptions. In real world conditions, breaking things up into a large number of subsets doesn’t mean that those subsets should be considered independent.

        • > But the point is, in the case where you have a continuous function, the prior “every point on this function is unrelated to every other point on this function” is a weird prior that doesn’t even result in the description of a reasonable function. No-one would use that prior for anything. It doesn’t make sense as a prior.

          I agree with that. (I think there was no explicit continuity here, and then there is the high-dimensional complication to make sure it wouldn’t help anyway.)

          My point was that even for discrete functions the “data contain no information for unobserved points” argument is either wrong (and a reasonable joint prior needs to have some structure to allow for that) or right (and a frequentist estimator using observed points to predict unobserved points may not make much sense).

        • Is it just the prior though? Isn’t there a likelihood issue? Basically Wasserman insists on a particular likelihood but a Bayesian isn’t committed to accepting his choice.

          Consider breaking up the map into boxes, and observing say 50 of the 31000 boxes, let’s say for per capita square footage of dwelling or something similar.

          If ive observed no boxes and then am asked about box 81 i will have some prior. But what if ive observed the neighboring box 80? Even without using the fact that neighboring boxes are more alike than distant boxes I might decide that the overall average across all boxes is close to the average of observed boxes. But thats a likelihood choice (a hierarchical data generation model)

          So, in fact in most cases I would gain information for my belief about unobserved boxes from the observed boxes. It could be through something like spatial correlations, but even without that it could be through hierarchical models of averages etc.

          I’d have to work out the math more carefully but if our model is just

          sigma ~ Exponential(100)
          mu_all ~ Normal(0,100)
          mu[i] ~ Normal(mu_all,sigma)

          Then as we observe a few boxes, our estimate for unobserved boxes concentrates around the overall average mu_all with variation near observed variation, because of the hierarchical pooling.

          We aren’t limited to independent likelihood as Bayesian either.

        • Carlos, you write “Is it true that “The data contain no information for most of the θ‘s.”? Not in general!”

          I think you are right that that there are priors where the likelihood will be informative about theta, even when observing only a small sample. As you say, you can assume some dependence between bins.

          But isn’t the point of Wasserman’s example to get a consistent estimator, that is uniformly consistent for all theta? Furthermore, the binning of one dimensional theta is only an intuition. His real example is for theta being a function of [0, 1]^d with d = 100.000. To get a prior where neighboring bins are informative sounds difficult, no? You need some very strong smoothness assumptions. I think that’s why Daniel’s example below also fails. Some smoothness if very reasonable for d = 2, as in his example with regions of the US, but not for d = 100.000.

        • > To get a prior where neighboring bins are informative sounds difficult, no? You need some very strong smoothness assumptions.

          It’s not necessarily about smoothness. You could have no smoothness or even no notion of ‘neighboring’ (no metric space for the x’s) and have correlation between the θ’s.

          It’s not about an informative prior either. The thing is not to “assume some dependence” – it’s “not to assume independence”. If we assume independent priors for the many (infinite) θ’s we assume that ψ has a very (infinitely) narrow distribution.

          If that’s unacceptable we just need to change the prior. One doesn’t need to assume that a systematic correlation exists, only that it’s not impossible. The prior average correlation doesn’t need to be zero. Given that we care about ψ we may also have a model which includes explicitly the average value of the θ’s.

          It seems unfair to ridicule the Bayesians for using a silly prior with undesirable properties and then to ridicule them again for wanting to use a different prior or model because that’s somehow cheating.

        • I don’t really understand what you are saying. How can a prior that doesn’t rule out correlation between the theta help? Wouldn’t you need a very strong prior to learn from say 1000 samples about theta: [0, 1]^d, —> [0, 1] with d = 100.000?

          Although here https://dansblog.netlify.app/posts/2022-11-12-robins-ritov/robins-ritov I read that some useful priors do exist, but I didn’t check.

          Anyways, Dan argues, as far as I understand, that the problem is not necessarily the prior. Rather, the pi(x) (the probability to observe Y given x) drops out of the likelihood, because it is a known constant. But we need those the estimate P(Y = 1). To use pi(x) he reformulates the problem and asks how n_j, the number of observed Y_j from N_j samples of group j influences \theta_j. Because n_j ~ Binom(N_j, \pi_j) he now can use the p_j and all is well. Or something like this.

          Sorry for the formatting, would be cool to be able to use LaTeX here.

        • > How can a prior that doesn’t rule out correlation between the theta help?

          Consider for example a joint prior with all the marginals uniformly distributed on [0 1] but pairwise correlation rho (the same for all pairs) with some prior distribution for rho non-null for the whole [0 1] interval (with infinite thetas they the cannot all have negative correlations).

          If that seems as a strong assumption I don’t see how assuming rho=0 is any less strong.

    • Ram:

      Just to be clear, yes, I said that a Bayesian is someone who uses Bayesian inference even when its inappropriate. But I never said, nor do I believe, that a Bayesian has to only use Bayesian inference. I use non-Bayesian methods all the time. For example there’s this paper from 1993 that has tons of statistics but nothing Bayesian. And you can find other examples of non-Bayesian applied work in my published papers. I think Bayesian methods are great, but sometimes other tools will do a better job for the task at hand.

      • Granted—you might say a Bayesian is someone inclined to take a Bayesian approach in many cases where it may not be appropriate, even though they will sometimes take a non-Bayesian approach in other cases. But it’s the first part that’s relevant here—trying to use it when not appropriate suggests there is something intrinsically valuable about using it, such that even when it is inappropriate in other respects it is still worth using. But that intrinsic value can perhaps be overcome, explaining why you will sometimes use non-Bayesian procedures.

        • A Bayesian is simply someone who is willing to use the mathematics of probability to represent credence given to a proposition. If you insist that probability be used only when it represents the long run frequency of observable outcomes, you’re a Frequentist. If you use probability in terms of propositions but only on observable outcomes not on plausibility of parameter values, and you choose parameter values by maximizing the resulting likelihood you’re a likelihoodist.

          Being a Bayesian is about the meaning you ascribe to probability in your models. it has nothing to do with “inclination to use Bayesian approaches… where it may not be appropriate”

        • Daniel–I don’t know how many statisticians actually identify as frequentists (unlike the Bayesians), but I suspect very few would “insist” probability theory be applied in only one way. I think the disagreement within statistics is about the proper way to evaluate analytic procedures. The frequentist perspective is that properties of a procedure’s sampling distribution are what matter, statistically speaking. The opposing perspective would be that other properties are what matter, or that additional properties matter too. The dominant variant of this is the Bayesian one, which (in part!) credits a procedure for “making sense”, where this has something to do with the perceived reasonability of the prior + likelihood + loss function needed to approximately replicate the procedure.

          As an example, consider the bootstrap. Many statisticians like the bootstrap because you can prove (as well as illustrate via simulation) that bootstrapping satisfies certain frequentist desiderata under certain assumptions (e.g., differentiability of the statistical functional wrt the empirical measure). Bayesians on the other hand seem to focus more on the peculiarities of the Bayesian bootstrap, the result of Rubin’s attempt to reverse-engineer the boostrap using nonparametric Bayes. In my telling, if such peculiarities disturb you, you are a Bayesian to some extent. If all that matters is whether boostrapping “does the job”–whether it provides the desired coverage under some weak assumptions–then you are not a Bayesian to any interesting extent, even if you use Bayes from time to time.

        • Identifying as a strong Frequentist seems to be more of a philosophy thing than applied statistics thing. But there are still plenty of people who care about the question “if I repeat this procedure many times how often would it do X” even when you will use the procedure exactly once.

          I’m a Bayesian, meaning I think the questions asked and answered by Bayesian stats are the ones that I usually care about. For example “given this data, and my understanding about the science, which values of unknowns would work ‘well’ in the sense that they are compatible with the outcomes observed?”

          Sometimes I really do care about repetitions, or the shape of distributions. In those cases I might do something like gaussian mixture models or Dirichlet distributions over histogram bins. This lets me focus on the questions I care about, like “which histograms are compatible with my assumptions?”

        • > you are not a Bayesian to any interesting extent

          This suggests a new hybrid form of the “Linda problem”: What is the probability that Linda is both a True Bayesian and a True Scotsman?

        • In the bootstrap, we 1) take the sample data and assume it is actually the population, then 2) see what would happen if we had done the experiment many times on that population. I’m afraid I don’t understand the logic of step 1.

        • So David, if there really is a population, and it can be thought of as a finite large sample from a distribution with a PDF, and the data really is a random sample of the population (this happens, like in quality control or similar but is actually not the norm in most science) then the process “sample from the dataset and add a small random perturbation” has a pdf that is essentially the same as “sample from the full population and add a small random perturbation” and both are essentially the same as the PDF that we consider the population to have been drawn from.

          This is basically about the convergence of the kernel density estimator to the pdf from which a dataset was drawn.

          But note, the big mistake here is that it gets used for many scientific problems that are NOT like “sample from a large finite population and measure some aspect of that population”. For example things like crop yields or psychology experiments or anything where you’re performing a time-series of experiments and don’t have rather strong evidence for a stationary distribution of outcomes.

          But in the case where you’re doing something like “from these 1000 homes, we will sample 27 of them, and calculate the total cost to repair all of them due to improper roof installation, and then try to estimate the variability we would see if we had chosen a different random 27 over and over… to get a sense of how the sampling limits our knowledge” then yeah it does have some applicability.

          We take this “there are N items and I saw n of them” as the core of statistics in most stats 101, whereas it’s actually applicable to a rather narrow range of stuff in science.

        • > the process “sample from the dataset and add a small random perturbation” has a pdf that is essentially the same as “sample from the full population and add a small random perturbation”

          Only when that dataset (pre-existing to the process you described) happens to be essentially the same as the full population. Which may be usually the case but it’s not always the case.

          Often it’s the case that the process “sample from the dataset and add a small random perturbation” has a pdf that is essentially the same as “sample from the full population and add a small random perturbation”.

          Sometimes it’s not the case that the process “sample from the dataset and add a small random perturbation” has a pdf that is essentially the same as “sample from the full population and add a small random perturbation”.

        • Carlos, yes as with all random processes there’s a certain fraction of the time that the representative thing isn’t that representative. as the sample size increases it’s generally better, and we’re talking about single-dimensional measurements and blablabla but the point is it’s not a bad methodology when you want to know about future variability and you’re talking about a finite population of real things that exist in the world generated using a validated RNG (as opposed to a theoretical population of “future experiments”)

        • I’m aware that the empirical distribution function converges to the PDF. My advsior made a career of studying that convergence. Maybe the finite sample is a good approximation to the PDF. If so, a Bayesian analysis will tell you this, and let you estimate the variability that you are interested in.

          Frequentist statistics often relies on asymptotic convergence rather than confronting the fact that you only have a finite sample. I doubt that most people using such techniques understand the logic (or lack thereof) of the derivations. The “frequentist desiderata” were often chosen to allow you to get an answer rather than being what people are really interested in. This is most obvious in hypothesis tests where almost no one can explain correctly what a frequentist hypothesis test says.

    • But how does one determine “how it performs”? This is simple for Bayesians since we have a model for the unknowns. Non-Bayesians often resort to asymptotics. Frequentist statistics often seems to be about getting answers to the wrong question. P values are another example of this.

  7. Nice posting. Some personal thoughts associated with it: I think many philosophers are too obsessed with belief and truth, at least when it comes to (statistical) models. I wouldn’t *believe* any of these models, frequentist or Bayesians. and that includes Bayesian modelling of “what I should believe”. I don’t think it’s the job of models to be “believed” or “true”. Of course we can still drop a model if data are inconsistent with it, but even then for me that is enough description of the reason, I don’t need to add that I drop it because the evidence says it’s “false”.

    Prior are usually discussed regarding their reflection of prior belief or knowledge. I think priors should be discussed “data analytically”, regarding their effect on the data analysis, and whether this effect is desirable (or how to choose it to have a desirable effect). I see that this may look problematic because (1) we (hopefully) want to find stuff out rather than having a result just because it’s desirable, and (2) the first desirable effect is to have our prior knowledge inform the final result (so back to square one then). These objections are fair enough, still the prior will have an influence on the result (why bother having one otherwise?), and “understanding” the prior and the analysis in my view means understanding how this influence plays out, and then understanding whether this is what we want, or whether it will likely mislead us. Most prior beliefs and knowledge don’t come in a way that can easily be translated into a prior distribution, so we can’t just trust that trying to reflect these in the prior will always work smoothly (apart from the fact that much published Bayesian analysis shows very little effort to even do that properly). Just as an example, priors are often chosen in order to favour simpler models against more complicated ones, and this is quite clearly not because there is a well founded belief that the truth is simple. Rather it is because such prior choices have an effect that is desirable for other reasons (and can be criticised if these reasons are not so good). Same with regularising priors.

    • Well put.
      There is currently a bit of a movement in medical research to perform a “Bayesian analysis” and arrive confidently at a “posterior probability”. There is rarely any discussion of the appropriateness of this approach, nor justification of the prior. P-values have their problems but at least one knows what is assumed.

      • I find this hard to understand. In a Bayesian analysis what’s assumed is very explicit… Such and such prior distributions over parameters, such and such data generating process, its all quite clear (if you have access to the code, without code nothing is clear in any paradigm)

        P values on the other hand, theres some paper somewhere where the person derives the test likely but have you read that paper? And the modifications to the test that were adopted in the interim due to whatever reasons, and reviewed the mathematics of the approximations in the test, etc. P values can be extremely opaque. I spent a week trying to figure out why my Bayesian survival model gave massively different results from Kaplan Meier estimates out of a canned routine my collaborators were using. After the deep dive into my model looking for bugs I finally dove into Kaplan Meier and found that its core assumptions were largely inappropriate for most cancer research. Its a standard in that field.

        My impression is that the justification for Frequentist methods is usually “thats the way everyone in the field usually does it” whereas the justifications for Bayesian models are individualized, and yes often not explained well. I’d still take the Bayesian method over the perpetuation of wrongness out of deference to tradition.

        • The main problem with P values is they are the answer to the wrong question. (This is a recurring theme in frequentist statistics: they couldn’t answer the right question, so they answer a different question which has some superficial similarity to the right question.) Sometimes the P value agrees with the answer to the right question, but you don’t know this until you do the Bayesian analysis.

      • You can do NHST with Bayes factors and its just as bad as p-values.

        The problem is the utter irrelevance of the hypotheses being considered. Eg, “Exactly zero correlation”. In an actual Bayesian analysis any theory predicting that would be ranked so low its not even worth considering, you drop it from the denominator.

        Same for the “alternative” that there is “non-zero correlation”. It is consistent with effectively any outcome, so has a very flat likelihood that only looks good compared to the “exactly zero correlation” hypothesis.

        Basically there are infinite possible hypotheses out there to throw into Bayes rule, but people only compare the two maximally impoverished ones. Then expect you can cure cancer by doing this enough times.

        Solution: Derive models from premises then compare the deduced consequences to observation.

Leave a Reply

Your email address will not be published. Required fields are marked *