Hedging your bets by weighting regressions?

Cody Boyer writes:

I’ve had a question in the back of my mind since I read this article years ago. What I’m curious about is this section, quoted below:

A major challenge is that there are a lot of plausible variables that affect primary and caucus outcomes — but relatively little data to test them out on, especially early in the process when no states have voted and few states have reliable polling. In cases like these, regression analysis is at risk of overfitting, where it may seemingly describe known data very precisely but won’t do a good job of predicting outcomes that it doesn’t know in advance.

Our solution is twofold. First, rather than just one regression, we run a series of as many as 360 regressions using all possible combinations of the variables described below. Although the best-performing regressions receive more weight in the model, all of the regressions receive at least some weight. Essentially, the model is hedging its bets on whether the variables that seemingly describe the results the best so far will continue to do so.

Is this an “actual” statistical method (like something with theory backing it up and a “right” way to do it?). Why do this instead of using something like AIC to select the model to use?

My response:

I’m not sure exactly what they’re doing, but it sounds like stacking based on predictive averaging. Here’s a paper on Bayesian stacking, but the non-Bayesian version’s been around for awhile.

Choosing one model using AIC or other approach is a special case where one of the models gets weight 1 and all the rest get weight 0.

More generally, fully expressing uncertainty in predictions is a challenge. You can do all the model averaging you’d like but you’re still extrapolating from the data you have. And then you can add extra uncertainty in your prediction to account for what’s not in your model, but when working with a multivariate distribution (for example, forecasting an election in fifty states at once), it still be hard to get things right. Add uncertainty so that certain marginal forecasts make sense, and this can distort the forecast of the joint distribution (as in that notorious map from fivethirtyeight.com that showed a scenario where Biden won every state except New Jersey). This is not to say that our forecasts were better. We don’t have great tools for capturing multivariate uncertainty. This comes up in missing-data imputation too.

Anyway, I like predictive model averaging, just don’t want to overstate its benefits.

12 thoughts on “Hedging your bets by weighting regressions?

  1. This discussion makes me curious about a related point: how do you *express* (plot, summarize, discuss, etc) the uncertainty you do have in a model’s prediction of a probability (here it would be P(candidate wins primary), for example; in my work it’s more commonly P(event A is a member of class B) in a mixture model)? Formally, unlike for model parameters, a Bayesian model just produces a single “point estimate” of such probabilities—which comes from integrating out the model parameters: P(candidate wins | data) = integral( P(candidate wins | regression coefficients) p(regression coefficients | data), d(regression coefficients)). But you do have access to an uncertainty measure: treating P(candidate wins | regression coefficients) just like any other parameter, the posterior distribution of coefficients induces a distribution on P(candidate wins | …). I guess one should interpret this as “in the hypothetical future where we have collected so much data that we measure the regression perfectly, this is the probability we would impute right now to candidate-wins, over the range of regressions that are consistent with the present data set.” At least, that’s how I’ve explained it (and plotted error bars, etc) when I’ve had to do it in my own work. But I’m wondering if my amateur approach is backed up by some professional analysis, or if maybe I’m missing something, etc. Has anyone done formal work on this problem?

    • Check out Jaynes’ Ap distribution. Chapter 18 here:

      Suppose you have a penny and you are allowed to examine it carefully, and convince yourself that it is an honest coin; i.e. accurately round, with head and tail, and a center of gravity where it ought to be. Then you’re asked to assign a probability that this coin will come up heads on the f irst toss. I’m sure you’ll say 1/2. Now, suppose you are asked to assign a probability to the proposition that there was once life on Mars. Well, I don’t know what your opinion is there, but on the basis of all the things that I have read on the subject, I would again say about 1/2 for the probability. But, even though I have assigned the same ‘external’ probabilities to them, I have a very different ‘internal’ state of knowledge about those propositions.

      To see this, imagine the effect of getting new information. Suppose we tossed the coin f ive times and it comes up tails every time. You ask me what’s my probability for heads on the next throw; I’ll still say 1/2. But if you tell me one more fact about Mars, I’m ready to change my probability assignment completely. There is something which makes my state of belief very stable in the case of the penny, but very unstable in the case of Mars.

      […]

      So, the stability of the robot’s state of mind when it has evidence E is determined, essentially, by the width of the density (Ap|E). There does not appear to be any single number which fully describes this stability. On the other hand, whenever it has accumulated enough evidence so that (Ap|E) is fairly well peaked at some value of p, then the variance of that distribution becomes a pretty good measure of how stable the robot’s state of mind is. The greater amount of previous information it has collected, the narrower its Ap-distribution will be, and therefore the harder it will be for any new evidence to change that state of mind.

      Now we can see the difference between the penny and Mars. In the case of the penny, my (Ap|E) density, based on my prior knowledge, is represented by a curve something like that shown in Figure 18.2(a). In the case of previous life on Mars, my state of knowledge is described by an (Ap|E) density something like that shown in Figure 18.2(b), qualitatively. The first moment is the same in the two cases, so I assign probability 1/2 to either one; nevertheless, there’s all the difference in the world between my state of knowledge about those two propositions, and this difference is represented in the (Ap|E) densities.

      […]

      The idea of an Ap distribution is not, needless to say, our own invention. The way we have introduced it here is only our attempt to translate into modern language what we think Laplace was trying to say in that famous passage, ‘When the probability of a simple event is unknown, we may suppose all possible values of this probability between 0 and 1 as equally likely.’ This statement, which we interpret as saying that, with no prior evidence, (Ap|X) = const., has been rejected as utter nonsense by virtually everyone who has written on probability theory in this century. And, of course, on any frequency definition of probability, Laplace’s statement would have no justification at all. But on any theory it is conceptually difficult, since it seems to involve the idea of a ‘probability of a probability’, and the use of an Ap distribution in calculations has been largely avoided since the time of Laplace.

      http://www.med.mcgill.ca/epidemiology/hanley/bios601/GaussianModel/JaynesProbabilityTheory.pdf

      Essentially the point estimate for a probability is the mean of some distribution. There is a many-to-one mapping of distributions to these means, and ignoring this has caused a lot of confusion.

    • Will, Anon:

      I agree with what Jaynes wrote.

      The key to expressing uncertainty in probabilities is to be specific about what’s being conditioned on. If you start with Pr(A)=0.3, say, nothing is added by saying in the abstract that “Pr(A) is uniformly distributed between 0.2 and 0.4” or “Pr(A) is uniformly distributed between 0.1 and 0.5.” Those statements provide zero information beyond “Pr(A)=0.3.” Where you can move forward is by defining some random variable b, and then you can talk about Pr(A|b) and a distribution of b. But what’s important here is that “b” has a meaning.

      In election forecasting, for example, “b” could be a measure of average bias of national polls, in which case we could have Pr(Biden wins | b) and integrate this over a distribution for b.

      For more, see our recent post, “A probability isn’t just a number; it’s part of a network of conditional statements,” and the last page of the 2006 article, “The Boxer, the Wrestler, and the Coin Flip: A Paradox of Robust Bayesian Inference and Belief Functions.”

      Also related is this post, “1 cool trick for defining conditional probability.”

      • If you start with Pr(A)=0.3, say, nothing is added by saying in the abstract that “Pr(A) is uniformly distributed between 0.2 and 0.4” or “Pr(A) is uniformly distributed between 0.1 and 0.5.” Those statements provide zero information beyond “Pr(A)=0.3.”

        Pr(A)=0.3
        – p(Pr(A) = 0.45) is undetermined

        Pr(A) is uniformly distributed between 0.2 and 0.4
        – p(Pr(A) = 0.45) is zero

        Pr(A) is uniformly distributed between 0.1 and 0.5
        – p(Pr(A) = 0.45) is equally likely as any other value between 0.1-0.5

        These seem different to me, what am I missing?

        Also, I still don’t see the problem with uniform priors, they are just an approximation used for computational efficiency. All the priors can cancel out in Bayes rule if they are approximately equal.

        I think it was discussed before and the argument was something about generating nonsense in the continuous case. But the continuous version of the rule is another approximation of the discrete case, because it is not possible to collect data of infinite precision. So the likelihoods are limited to some sig figs, which means the posterior must be as well.

        • What do you mean by “Pr(A)”?

          One possible answer is that A is a proposition and Pr(A) is a real number representing the degree of plausability that you assign to it.

          In that case, what do you mean by p(Pr(A)=0.45)?

        • I suspect that what Jaynes meant was “I don’t know how much plausibility to assign, but I am willing to consider plausibility in certain ranges as more reasonable given my background than plausibility in other ranges.

          You can create a parameter call it q and let’s say q is a possible value of the plausibility for statement A, then you could say different q values you are more or less willing to entertain. You have p(A|q) = q and you have a “prior over q” p(q) … This seems cromulent to me. I think it’s Jaynes trying to get at the idea of hierarchical models.

        • In that case, what do you mean by p(Pr(A)=0.45)?

          It is the probability the posterior probability we estimated may equal that value. There is uncertainty in the probabilities we calculate because we didn’t include every possible model and relevant data. I don’t know about the Ap distribution concept, but that is my reading of Jaynes.

        • > It is the probability the posterior probability we estimated may equal that value.

          The posterior probability that we have estimated is equal to 0.45 or it isn’t.

          > There is uncertainty in the probabilities we calculate because we didn’t include every possible model and relevant data.

          We included the model we wanted and the data we had. The uncertainty that Jaynes talks about is about the future, he talks about how “harder [will it be] for any new evidence to change that state of mind”.

          One can represent that explicitly using the probabilistic framework, with assumptions about what the new evidence could be (its probability distribution) and what would be the new state of mind (maybe using an underlying model and the usual Bayesian updating).

          In the Pr(“there was once life on Mars”)=1/2 example we could get new evidence from a space probe looking for water and we could represent its potential impact on our estimate using Pr(“water is found”), Pr(“there was life”|”water is found”), Pr(“there was life”|”water is not found”), etc.

          In the coin example Pr(“heads on the first flip”)=1/2 because you’re conviced it’s fair – and whatever the outcome your initial Pr(“heads on the second flip”)=1/2 won’t change. (But note that the Pr(“heads on the first flip”) will change if we get new evidence about the outcome of first flip after it happens! If the probability of a proposition is between 0% and 100% it can potentially become both as soon as we learn whether it’s true or not. What remains 1/2 in this example is the probability in a frequency sense about a sequence of future outcomes.)

          You could also have Pr(“heads on the first flip”)=Pr(“heads on the second flip”)=…=1/2 under a different model where the coin is either fair (50%) or has two heads or two tails (25% each probabillity). In that case you expect the outcome of the first flip (a 50/50 event in your model) to change the value that you assign to Pr(“heads on the second flip”), etc.

          As Andrew and Daniel have mentioned that’s a kind of hierarchical model. (But there also seems to be some conceptual difference between this “potential distribution under new evidence” and the “latent variables” in Andrew’s example. In any case looking at the internals of the model to see where the final answer comes from is always informative.)

          (I should re-read that chapter, I’ve just seen that it’s much longer than I remembered.)

  2. The idea of doing a bunch of regressions with different combinations of explanatory variables is very similar to the motivation of a “Gaussian random forest, or, for that matter, any random forest. A random forest is a collection of “random” trees. There are different ways of doing the randomization, e.g. each tree can include only a random selection of explanatory variables, or each tree can have access to all of the variables but any particular branch of the tree can only use some selection of them. I’m sure there’s a lot of theory about them that I’m unfamiliar with. In my practical experience the details don’t matter much, if you stay near the standard approaches.

    To make a prediction you generate an outcome from each of your thousand trees (or however many). In general you will get a distribution of outcomes, which you can summarize.

    I’m not saying anything here that 100 of this blog’s readers don’t know better than I; just thought I’d mention it because no one else has.

Leave a Reply

Your email address will not be published. Required fields are marked *