Probabilistic numerics and the folk theorem of statistical computing

U.S. election day is tomorrow. So let’s talk about something else:

1. Encoding prior information using non-generative modeling

I was talking with Hong Ge about the uses of non-generative models in probabilistic programming. An example I gave is the use of prior information on derived quantities. For example, suppose you have a logistic regression with a weak prior on the coefficients:

data {
  int N;
  array[N] int y;
  vector[N] x;
}
parameters {
  real a, b;
}
model {
  a ~ normal(0, 5);
  b ~ normal(0, 5);
  y ~ bernoulli_logit(a + b*x);
}

and you feed it some data x_i, y_i, i=1,…,N. And then you want to add additional information on predicted values. Perhaps, for example, you think the predicted probability for observations 1, 2, and 3 is likely to be close to 50%. Then you could add something like this to the model block:

  vector[N] prob;
  prob = inv_logit(a + b*x);
  prob[1:3] ~ normal(0.5, 0.2);

I’m purposely making this code kinda clunky and purposely using a model that’s not completely well specified (p is constrained to between 0 and 1, but the specified normal prior has no such constraint) just to show how direct this process can be. Also I haven’t actually programmed the model and checked it, so my code could have a bug! Anyway, the point is that we can just throw in this prior information wherever we want, not just on the official “parameters” of the model.

As noted above, the resulting Stan program does not correspond to a generative model! The code produces a target function that is a constant plus a log posterior density, log p(a,b|y), implicitly conditional on the unmodeled data x and N, without separately defining a prior distribution p(a,b) or a data distribution p(y|a,b). There’s no direct way to sample from the prior. The way I like to think of it is that the prior information on those predicted probabilities represent additional data.

2. Replacing hard constraints by soft constraints

This sort of prior on a derived quantity can be useful in many statistical settings. For example, when poststratifying a survey or causal estimate, it’s often the case that we have some partial information on the full poststratification table and also exact information (for example, from a census) on certain marginals. So we might have the marginal totals for age x ethnicity, county, and ethnicity x sex, but not the full joint table. In that case, fitting can be difficult: it can be awkward to parameterize the models so that the margins are known, also computation can be slow: all these hard constraints can lead to difficult geometry.

We can often fix this problem by replacing the hard constraints by soft constraints. As the saying goes, uncertainty greases the wheels of commerce. This is not (yet) automatic—there’s a tradeoff in that if you make the constraints too soft, you’re throwing away some information, but if you make them too narrow, your computational problems can return—and I think further research is needed on developing automatic or semi-automatic methods for implementing soft constraints in such problems.

3. The folk theorem

Regular readers will know the Folk Theorem of Statistical Computing (for more on the topic, see this post by Bob). The above is an example of that folk theorem, in that, in the real world, purportedly hard constraints actually are soft! For example, census numbers are typically only estimates, not exact. Indeed, the first place this idea came up in my own work was with Frédéric Bois and Jiming Jiang when fitting our differential equation model in toxicology: our algorithm had poor mixing, and we realized that the problem was that certain biological measurements we’d taken as known, were only measured; we added the measurement error terms and our model fit better.

4. Probabilistic numerics

Hong pointed out a connection of the above ideas to probabilistic numerics, a field of numerical analysis that I’d never heard of. Here’s a reference that Hong recommends.

11 thoughts on “Probabilistic numerics and the folk theorem of statistical computing

  1. The Census’ American Community Survey also publishes a very helpful set of variance replicate estimates, https://www.census.gov/programs-surveys/acs/data/variance-tables.html. These are 80 bootstrapped samples of many basic measures. This is much more useful than using the margins of error on each field independently.

    I see so many uses of ACS data that uses these point estimates as deterministic data, when in fact the precision of these measurements at the tract level can vary 3x or more.

  2. Thanks for that example! I wasn’t aware that Stan could handle constraints on derived quantities. In my view this capability is a big missing piece in Bayesian computational tools (and in Bayesian pedagogy, more broadly). Would it be easy (and OK) to place such constraints on the posterior distribution on the parameters directly? One often knows some facts about the answer, and it would be nice to not have to go through the gyrations in the modeling to have the posterior agree to those facts. For example, the fact that many of the coefficients of my linear model are all exactly zero.

    • Harsha:

      I prefer to put in the constraint as additional data (as in the example in the above post) rather than constraining the posterior. If, as you say, we “know some facts about the answer,” I’d say the typical scenario is that we have some information about theta, not that we have information about p(theta|y).

      • Andresw:

        I agree that my example is probably better modeled by a prior with non-zero mass on only points with many zeros. However, the inference might be difficult, which is what motivated my question.

        Perhaps an example closer to what I mean is when I want to encode my intuition that any data that is more than 5 times the median is suspect, and was probably caused by rats chewing through the sensor cables. These points should be down weighted in the calculation of the posterior by how far they are from 5 times the median:

        something like log P(theta|y) = log P(theta|y_good) + 1/(threshold – y)^2 log P(theta| y_bad)

        This clearly breaks the Bayesian abstraction, but being ignorant of the probability of rodent infestation I am unsure how to be a true Bayesian.

        • You can put constraints on transformed parameters in Stan, but they only serve the purpose of error checking. If the constraint is violated at the end of the block, then the whole iteration is rejected. Using this to control sampling leads to very inefficient rejection sampling. It’s far better to encode any constraints through constraining transforms plus a change-of-variables adjustment. That’s how all the built-in constraints work.

          If you have information about the posterior (as you put it, “know facts about the answer”), you can encode that through a prior over constrained parameters. It’s going to require what you call “gyrations” one way or another.

          You can directly model your intuition that there are two kinds of measurements—ordinary ones and ones where rats chewed through a cable. You do that through a mixture model with a distribution of “rat chewed” data and a distribution of “normal” data and a probability that a given data item is rat chewed. You won’t need heuristcs like “5 times the median” and can just fit the model. Then you can have priors on what rat-chewed and normal data looks like, or if you have multiple experiments, build these up hierarchically.

          Also, if you need to set parameters to exactly zero, you can use a “spike and slab” prior. In Stan, you can only code this up in small dimensions due to the combinatorics. If you try to sample, it’s also combinatoric, and can lead to NP-hard problems. That doesn’t mean it can’t work in some cases, especially in lower dimensions.

          The good news is that you can do this all smoothly (pun intended) within Bayes—no need to roll out heuristics of mixing or thresholding.

        • Bob:

          When I’m talking about constraints on transformed parameters, I’m talking about “soft constraints,” which can be thought of as “priors” or as “external data.”

        • The easiest way to understand Andrew’s model is as a **generative** model where the first 3 data points are utilized as part of the background to generate the prior. Then it’s a generative model with a slightly complex prior.

          You can think of the posterior as proportional to:

          p(x[4]..x[n] | a,b) p(a,b | x[1],x[2],x[3],Other_Background)

        • Andrew: Yes, I see that you were doing a kind of soft constraint. I’d like to understand what the implications are for the original variables of putting down a density on a transformed parameter derived from a combination of other parameters. I also noticed there’s no change-of-variables correction, which I guess makes sense because you’re not really doing a change-of-variables, just adding a soft constraint.

  3. How will encoding some priors using non-generative modeling affect tools like simulation-based calibration or loo?

    For simulation-based calibration, would that prior information need to be incorporated into the generated quantities block so that the simulated datasets are consistent with the non-generative portion of the model?

    And for loo, would the observations for which you have additional prior information (such as observations 1-3 in your example) need to be excluded or handled differently from the leave-one-out calculations? Possibly just by excluding them from the log_lik array.

    • Eb:

      For simulation-based calibration, it would make sense to consider as the prior information that informs the quantities of interest as “data” rather than as part of the “prior.” This connects to the general issue that the division of information between prior and data has no impact on the posterior distribution but does have implications for model checking and calibration. It’s actually part of a big issue that’s not well explored in Bayesian theory or practice.

    • “Non-generative” is just a matter of perspective. I’ve been meaning to do a whole post on this because it comes up a lot. In this case we still have a joint prior p(theta), where theta = (a, b), albeit conditioned on the “unmodeled” data x,

      p(a, b | x) = normal(a | 0, 5) * normal(b | 0, 5) * normal(inv_logit(a + b*x) | 0.5, 0.2)

      There’s not an easy way to use a normal random number generator to draw from p(a, b | x). But you can easily sample from p(a, b | x) using HMC with Stan. If you thin down to roughly independent draws, you can use them for prior predictive checks.

      Now if we think of our observations as y, then we have a sampling distribution (or likelihood thought of as a function of a and b),

      p(y | a, b, x) = bernoulli(y | inv_logit(a + b*x))

      Throughout the whole process, x is just a fixed set of covariates or features. So we’re also good to go for simulation based calibration checks.

Leave a Reply

Your email address will not be published. Required fields are marked *