Bayesian inference for discrete parameters and Bayesian inference for continuous parameters: Are these two completely different forms of inference?

I recently came across an example of discrete Bayesian inference: a problem where there were some separate states of the world and the goal is to infer your state given some ambiguous information. There are lots of such examples, such as estimating the probability someone has a certain disease given a positive test, or estimating the probability that a person is male given the person’s height and weight, or the probability that an image is a picture of a cat, a dog, or something else, given some set of features of the image.

Discrete Bayesian inference is simple: start with the prior probabilities, update with the likelihood, renormalize, and you have your posterior probabilities. So it makes sense that discrete examples are often used to introduce Bayesian ideas, and indeed we have some discrete examples in chapter 1 of Bayesian Data Analysis.

Just to be clear: what’s relevant here is that the parameter space (equivalently, the hypothesis space) is discrete; it’s not necessary that the data space be discrete. Indeed, in the sex-guessing example, you can treat height and weight as continuous observations and that works just fine.

There’s also continuous Bayesian inference, where you’re estimating a parameter defined on a continuous space. This comes up all the time too. You might be using data to estimate a causal effect or a regression coefficient or a probability or some parameter in a physical model. Again, the Bayesian approach is clear: start with a continuous prior distribution on parameter space, multiply by the likelihood, renormalize (doing some integral or performing some simulation), and you have the posterior.

What I want to argue here is that maybe we should think of discrete and continuous Bayesian inference as two different forms of reasoning. Here I’m trying to continue what Yuling and I said in section 6 of our Holes in Bayesian statistics paper and in our stacking paper.

In general you can define Bayesian inference for a continuous parameter as Bayesian inference for a discrete parameter, taking the limit where the number of discrete possibilities increases. The challenge comes in setting up the joint prior distribution for all the parameters in the model.

The logical next step here is to work this out in a simple example, and I haven’t done that yet, so this post is kinda sketchy. But I wanted to share it here in case any of you could take it further.

20 thoughts on “Bayesian inference for discrete parameters and Bayesian inference for continuous parameters: Are these two completely different forms of inference?

  1. Under finite or countable parameter spaces, likelihood functions and probability distributions behave the same under reparametrization, but not so under countable parameter spaces. How can one get around that?

  2. I’m having a hard time seeing the distinction you are trying to draw. As you say, Bayesian inference about a continuous parameter is about reallocating probability across an infinite set of possibilities, in contrast to a finite set of possibilities in the discrete case. As you say in section 6, both are just Bayesian inference.

    So by “discrete Bayesian inference”, are you referring specifically to the model comparison problem you and Yuling describe in your 3rd example? If so, I agree it seems to require some additional machinery to deal with the sensitivity of Bayes factors to priors on model parameters. But this seems like a technical problem, rather than a fundamental difference in the manner of inference. In the end, we are still trying to reallocate probability across a set of models. The problem is how to define the models in such a way that the probabilities map onto what we would reasonably think of as beliefs, in the specific setting in which they are applied.

  3. > taking the limit where the number of discrete possibilities increases

    don’t folks go in the other direction all the time? The intro progression I see most often tends to be something like: derive Bayes theorem from joint & conditional probabilities -> medical testing of rare diseases -> flipping a coin and deriving beta conjugacy to the binomial -> grid approximation when we can’t math everything out in closed form -> curse of dimensionality oops -> MCMC…

    so the grid approximation serves as the discrete approximation of a continuous parameter space, and you’d “empirically” show that the finer the grid the better your approximation is of the correct continuous joint posterior (and sometimes discretizing nuisance parameters comes back later when we want to marginalize them out)

  4. How could Bayesian inference for discrete parameters and Bayesian inference for continuous parameters be two completely different forms of inference if the former is the limiting form of the latter?

    Arguably the discrete case is the relevant one – as it is sufficient for any practical purpose – and the continuous limit is a mathematical convenience.

    • Agreed. In fact we can write the continuous case as a discrete case with infinitesimal number using Nonstandard Analysis and then just be done.

      On the other hand, computing techniques are somewhat different in the different cases. Jumping between discrete values of a parameter can be difficult as it requires potentially shifting all the continuous parameters to a different “regime”

        • Phil:

          No, I’m honestly confused on this point. I’ve been talking with Yuling abut it; I guess we should write something more formal with some clearer examples to make our point to explain why this is a legitimately challenging question. The comments to the above post were helpful in revealing to me that I have not made my point at all clear.

  5. Including discrete parameters can certainly have unexpected consequences on other “downstream” continuous parameters in the model. For example, suppose whether or not someone has a certain disease is the discrete parameter (0/1) and the proportion of their cells that are infected is a “downstream” continuous parameter (i.e., is necessarily equal to exactly 0 if the person does not have the disease, but is otherwise continuous from 0 to 1). Based on the result of a diagnostic test (i.e., the data), the posterior distribution for proportion of infected cells may come out a bit strange. We consider this in

    https://harlanhappydog.github.io/files/lesserknown.pdf

    which is work in progress and should be posted up on arxiv shortly… any feedback is greatly appreciated!

    • Exactly, suppose you have a physical model where either something oscillates with a slow exponential decay around a positive average value, or it just purely exponentially decays to zero.

      Now observe it for a short time… We see it start at a high value and decrease to a low value… This could be a fast exponential decay, or just the first quarter cycle of an oscillation with a slow exponential decay.

      Making a transition between the two discrete states might require the exponential decay parameter to go from maybe milliseconds to hours…. Computing with these kinds of models can be impossible because arranging for jumps between the states is virtually impossible. When the discrete parameter is to change, all the continuous parameters may need to make large jumps in parameter space.

      So although theoretically there is no difference practically the ability to compute with discrete models is quite different from the continuous case in some instances.

  6. The “problematic” example in the section 6 referenced in the post seems needlessly complicated.

    “Example 3. Model choice. Bayes factors run into trouble when used to compare discrete probabilities of continuous models. For example, consider the problem of comparing two logistic regressions, one with just an intercept […] and the other including a slope as well […] The Bayes factor comparing the two models will depend crucially on the prior distribution for the slope, b, in the second model […] even though this change will have essentially no influence on the parameters a and b within the model.”

    The same point can be made about a model without parameters P1(x)=Normal(0,1) and a model with one parameter P2(x)=Normal(mu,1). [Also using a discrete parameter instead of a continuous one, by the way.] The Bayes factor will depend on the prior distribution for mu. Penalizing more the less specific models seems desirable, in fact.

  7. The likelihood is a function of some observation(s). Those observations have some finite precision, so the likelihood/posterior must as well (standard sig fig rules apply).

    Thus, there are no continuous observations or continuous parameters. They are always discrete. Even if the universe is continuous, our measurements of it will always be discrete.

    Usually it is fine to use the continuous approximation, but if you start deducing strange stuff its gotta be checked against the discrete gold standard.

  8. My previous post gave some goofy examples on discrete parameter inference https://statmodeling.stat.columbia.edu/2021/12/13/another-example-to-trick-bayesian-inference/

    Even for continuous parameter space, you could come up with similar goofy examples for inference on non-euclidean space, manifold etc. Some measure change is needed. We have been so used to work with R^d to ignore these problems.

    Another example in which putting a prior is tricky is sensitive analysis: causal inference people test causal assumptions by perturbe the propensity score, an object on [0,1]. Is the default prior uniform? Is propensity score changing from 0.7 to 0.8 a big change? There is ambiguity.

    Lastly, in addition to mapping a discrete inference problem into a continuous problem, the inverse direction is also viable: to improve continuous inference using stacking-type-ideas. There is much to do to bridge the gap.

    • goofy:

      [oxford] foolish or harmlessly eccentric

      [cambridge] silly, esp. in an amusing way

      [webster] being crazy, ridiculous, or mildly ludicrous : SILLY

      [collins] ridiculous; silly; wacky; nutty

      I expected you to take your examples more seriously :-)

      • Carlos, actually I am under the impression that it is an important character for an (counter)example/paradox to be goofy, or amusingly silly: silly such that readers can understand the message; and amusing so as to engage readers. In that sense, I should always applaud for a goofy storytelling :-)

  9. From a mathematical perspective, I’m having trouble understanding what you’re saying. Discrete probabilities are just densities integrated against the counting measure; unification of the two are to my understanding the purpose of measure theoretic probability, and bayesian densities are not mathematically different objects than any other density. I can see what you mean from a computational perspective–the types of inference algorithms that work on one are fundamentally different than those that work on the other. From a philosophical perspective, I guess I can see the argument that continuous models, or at least the computable ones, have some kind of principle of locality as a fundamental premise, so those models are distinct from those that have nonlocal jumps.

  10. If instead of a continuous variable x you have a discrete system i*dx, a differential equation in x becomes a finite-difference equation in i (times dx); but both equations will have the same solution forms, e.g. cos(x) and cos(i*dx). Which is another way of saying that a continuous system is the limit of a discrete system as the discrete increment goes to zero. In fact, many (perhaps all) of the systems we solve as continuous are in fact discrete, such as fluid flow (molecules) and heat transfer (photons and electrons, mostly).

    Which is well-known, but it sparked this thought. Suppose time and space are discrete with minimum increments dt and dx. Then that universe has a speed limit of c=dx/dt (assuming dt is the minimum time any action can take and all travel must proceed in steps of dx).

    I am aware that so far no measurement of the Lorentz transformation has found a significant deviation from continuity, but I would like to think that Zeno was right and continuity is an illusion. Note that Democritus was aware of Zeno’s argument and extended it to continuous matter being an illusion (atoms) which seems to me best result of philosophy ever.

    • Which is well-known, but it sparked this thought. Suppose time and space are discrete with minimum increments dt and dx. Then that universe has a speed limit of c=dx/dt (assuming dt is the minimum time any action can take and all travel must proceed in steps of dx).

      If you assume discreteness, it is really easy to start thinking there is some simple set of rules a la Conway’s game of life that lead to everything else as an emergent property. Just add that each “smallest space” is one bit of energy/information, and if a small set of correct rules are picked then the arrangement of these bits in a 3D array can lead to mass/charge/etc via interactions with the others.

      If you have never seen the “walking droplet” experiments that support the pilot wave interpretation of quantum mechanics, it is worth checking out, eg:

      https://www.youtube.com/watch?v=nmC0ygr08tE

      It is one of the most interesting phenomenon I have ever seen. But history is littered with previous attempts to use the most advanced technology of the day as an analogy to explain things.

  11. There are certainly some differences between the two, but they tend to be niche cases. I don’t mean to belittle them by calling them niche cases. Niche cases can sometimes be very useful, but I wouldn’t call them a new category, because they can be explained as subtle divergences from the main category.. Like most physicists, I prefer to discretize everything and point out the niche cases at the end.

  12. The real fun starts when you have both discrete and continuous elements in your model. A simple case is unsupervised categorization, in which you have a bunch of heights and simultaneously try to figure out who’s which sex (or even how many sexes there are) and also the average and stddev height for each. A more interesting case is when you genuinely care whether one thing has *any* effect on another (prayer on covid recovery) but also if so what size effect it has. You could treat this as two parameters, or one with a dirac delta in its prior pdf.

  13. I keep thinking about this post so I come back and make this extra comment. I am sure it will be read in 2034 via @StatRetro.

    There are three ways to do discrete parameter inference:
    1. Make Bayesian inference as it is. That is BMA. You will run into all sorts of problems as discussed above.
    2. Treat discrete inference as an approximation of continuous inference. Say I am making inference on 10 eight-school-models with discrete tau = {0.1, 0.2, 0.3, … 0.9, 1, 10, 100}, then instead of doing BMA, I would like to view this problem as quadrature so as to take into account the uneven spacing. quadrature generally differs from BMA unless you have a evenly spaced grid.
    3. Stacking. Stacking is great. But for discrete parameter inference, the vanilla stacking ignores the prior and we need to add it back, which is itself not trivial.

Leave a Reply

Your email address will not be published. Required fields are marked *