Nick Firoozye writes:

I had a question about BMA [Bayesian model averaging] and model combinations in general, and direct it to you since they are a basic form of hierarchical model, albeit in the simplest of forms. I wanted to ask what the underlying assumptions are that could lead to BMA improving on a larger model.

I know model combination is a topic of interest in the (frequentist) econometrics community (e.g., Bates & Granger, http://www.jstor.org/discover/10.2307/3008764?uid=3738032&uid=2&uid=4&sid=21101948653381) but at the time it was considered a bit of a puzzle. Perhaps small models combined outperform a big model due to standard errors, insufficient data, etc. But I haven’t seen much in way of Bayesian justification.

In simplest terms, you might have a joint density P(Y,theta_1,theta_2) from which you could use the two marginals P(Y,theta_1) and P(Y,theta_2) to derive two separate forecasts. A BMA-er would do a weighted average of the two forecast densities, having previously had a model prior. A large-scale modeler would instead consider the larger model.

It is as though the BMA-er thought P(Y,theta_1,theta_2) was separable (i.e., the effects of the two parameters were independent?) and could easily be represented by (some weighted average?) of the two marginals P(Y,theta_1) and P(Y,theta_2).

In the simplest of cases, why would anyone want to do that? Is it an inability to impose proper priors on the larger parameter space? Would collinearity be an issue? (I know this is less of an issue for a Bayesian than a frequentist).

Of course I’m thinking in terms of simple easily combined models (e.g., a regression on two variables), and a BMA-er could easily combine far more challenging models that don’t naturally form a supermodel.

My reply: Conditional on being required to use noninformative priors on each submodel, the strategy of model averaging or model selection can be better than using the larger model. But I agree that, if you’re thinking of fitting the small model or the large model, it makes more sense to use an informative prior that allows for shrinkage directly. As to your other question, about combining incompatible models, I think it best to create a supermodel that continuously expands the original options. Not that I always (or even usually) do this, but I think it’s the right way to go.

Firoozye follows up:

This makes sense in that one could easily get problems of ill-posedness if there is insufficient data if using uninformative priors.

I believe some of the early examples of BMA involved running 2^k regressions and averaging rather than running a single k dimensional regression with shrinkage (so much simpler doing a lasso,… er…, shrinkage estimator, to be honest). And the BMA was meant to be preferable to frequentist sequential model selection or using a criterion which involves the 2^k regressions anyway.

Isn’t BMA a method for forming hierarchical models? And if so aren’t you saying it is better to have a single large non-hierarchical model than it is a hierarchical model?

If we extend the question to continuous parameters (not just discrete model choices as in the BMA), we usually consider P(Data | theta) for the model likelihood and P(theta | phi) as the prior and P(phi) as the hyperprior, but couldn’t this all have been done as P(Data | theta, phi) and P(theta, phi)? If we had informative priors on both parameter (and hyperparameter) we might have a better model yet?

If so, why should we be doing hierarchical models at all, other than they can be far more intuitive than super-models?

It’s a genuine interest, because I am very interested in second-order probability as a way of capturing uncertainty together with smooth ambiguity aversion, something I find far more tractable, and even more subtle, than all the very complex imprecise probability settings, which for all their complications only lead to min-max decision rules which would come out of belief/plausibility functions, etc. While Dempster and Shaffer do extend Bayes’ theorem to their special case, it just seems their theory is so more complex than just using a hierarchical model. Second order probability is merely a hierarchical model, with weights being put on a family of probability measures, and so much more intuitive than belief functions.

My response:

Indeed, discrete model averaging can be seen as a sort of implementation of continuous model expansion in which the probability of setting a coefficient to zero is a way to get some shrinkage. I just don’t see it as a model that makes much sense in the applications I work on. For similar reasons, I have no particular interest in seeing which sets of predictors the fitted model wants me to include.

On your larger question: yes, I think hierarchical priors can be useful in specifying dependent uncertainty. One of my favorite examples is the two-level prior distribution in our toxicology paper (an example we also discuss in BDA). As for belief functions, they just mystify me (see example here).

I’m mystified by belief functions, but also by your paper (The Boxer, the Wrestler, and the Coin Flip). I just read it again and I still don’t understand the point of it. As I understand it, the idea is to distinguish between two kinds of uncertainty: the outcome of a coin toss, in which I know the “true probability” is 50-50, and the outcome of a single match between the world’s greatest boxer and the world’s greatest wrestler, in which I don’t know the “true probability” but my ignorance is such that I wouldn’t be willing to favor either competitor in a wager. I agree that that there are important differences between these two types of uncertainty (ignorance vs stochastic variability) and when I say “important differences” I mean that a person’s decisions might depend on which type of uncertainty is relevant in a given situation. For instance, in a wagering situation, if I think an outcome is dominated by stochastic variability I won’t bother trying to collect more information in order to predict it, whereas if I’m “ignorance-dominated” then I might. (For instance, if you learn a bit about boxing and westling, you’ll find that in the case of a great boxer vs a great wrestler, under any reasonable set of rules the wrestler is likely to win. The Ali vs Inoki match did not have a reasonable set of rules, though, making it hard to handicap).

So, fine, I agree that there are important differences between ignorance and other sources of uncertainty. What I don’t understand is how that paper illuminates them. You discuss two possible approaches to predicting the joint outcome of (coin flip, athletic contest) — “robust Bayes” and “belief functions” — and you show that in a particular rather contrived scenario they both yield the same answer, which is, in fact, the right answer. You argue that this indicates a problem with both robust Bayes and belief functions: “Using a simple example coupling a completely known probability (for a coin flip) with a completely unknown probability (for the fight), we have shown that robust Bayes and belief functions can yield counterintuitive results.” I find the right answer to be the intuitive one — if I assess “coin comes up as heads” and “wrestler wins” as being independent 50% propositions, and I know that either both or neither of those outcomes did in fact happen, then I think it’s intuitive (and correct) that it’s equally likely that the outcome was (H,W) or (T,B). Both the Robust Bayes approach and the Belief Function approach yield this intuitively correct answer. So what’s the counterintuitive result?

I think you must be saying that “neither of these approaches makes sense, so it’s counterintuitive that they both give the right answer”? Or something else?

Thanks, Phil. I’ve never understood this article either. And while I grant your point that one might behave differently to get additional knowledge in the case of the two types of uncertainty, I’ve never really thought there was any important difference between them, holding the information set constant. Ever since I first heard about the Ellsberg Urn paradox in graduate school, I have been mystified how anyone would treat the two setups any differently, which in my mind resolves the paradox. http://en.wikipedia.org/wiki/Ellsberg_paradox

Phil:

It’s frustrating when someone misreads something I wrote, because ultimately I have no defense! Even if I’m right, I’m wrong, because I must have written the article unclearly if you misunderstood it!

In this case, you did misunderstand (i.e., I wrote it unclearly). You write:

I agree of course that my scenario is contrived. But the whole point of the example is that these two different methods give much

differentanswers in this example! Robust Bayes gives you a probability that could be anywhere from 0 to 1, whereas belief functions gives you a probability that is concentrated completely at 0.5.Ah, so that’s what you mean by:

“Because we are allowing the parameter π to fall anywhere between 0 and 1, the robust Bayes inference leaves us with complete uncertainty about the two possibilities X = Y = 0 and X = Y = 1. This seems wrong in that it has completely degraded our inferences about the coin flip, X. Equating it with an event we know nothing about—the boxing/wrestling match— has led us to the claim that we can say nothing at all about the coin flip. It would seem more reasonable to still allow a 50/50 probability for X—but this cannot be done in the robust Bayes framework in which the entire range of π is being considered.”

I get it now, but yes, I blame you.

Somehow I read the paper at least twice without realizing that when you say “we can say nothing at all about the coin flip”, you are NOT simply saying “we have no reason to believe heads is more likely than tails.” You’re saying “we can say nothing at all about the probability that a coin flip will be heads.”

Now that I know what you were saying, I know what you’re saying. Yeah. OK, good!

OK, so the Robust Bayes probability isn’t “counterintuitive”, it’s wrong.

Still not sure about belief functions, though. Maybe I’ll read the paper once more, more carefully.

So much confusion, all caused by conflating the concepts of “probability” and “frequency”.

For this example, think of frequency as a parameter of a generative model (known for the coin, but unknown for the fight). Think of a probability distribution as a description of the available information regarding a particular variable. Probability distributions are calculated rather than measured, so once the model has been specified they are always known (but they change as a function of available information).

The probability distribution for the frequency is conceptually distinct from the probability distribution for the next outcome; it just so happens that when the probability distribution for the frequency is sharp (i.e. a delta function at f – that means we _know_ the value of f, as in the coin example), the probability for the next outcome is also equal to f. Sadly, this coincidence causes people to conflate the two concepts, leading to much confusion.

We can also calculate the probability of the next outcome in examples (such as the fight) where the probability distribution for the frequency is diffuse (we do _not_ know the frequency). With a symmetric prior distribution for the frequency this gives a probability of 0.5, which happens to be the same as for a coin known to be unbiased.

None of this leads to any difficulty at all, as long as we are careful to distinguish between outcome, frequency, probability of outcome, and probability distribution of frequency.

Of course, you could always _define_ probability to mean frequency (which is the definition implied in the article). Then you’re in for a world of hurt.

We’ve been having this conversation recently in related blogs (blogs of some of your frequent commentors at least), see for example:

http://www.entsophy.net/blog/?p=130#comments

and

http://models.street-artists.org/2013/08/22/bayesian-probability-distributions-frequencies-random-variables-and-model-fitting/

I was playing around with this a bit earlier. Suppose you have two Bernoulli random variables, say W and Z, with parameters p_w and p_z respectively, and a joint prior density g(p_w, p_z). Consider Bayesian updating on W = Z. The likelihood is p_w p_z + (1 – p_w)(1 – p_z). From the requirement that E(W | W = Z) = E(Z | W = Z) we get the (to me, far from obvious) theorem that for any function f(x,y) the double integral over the unit square of

x f(x,y) [xy + (1 – x)(1 – y)]

is equal to the double integral over the unit square of

y f(x,y) [xy + (1 – x)(1 – y)]

(Proof: Obvious for non-negative f(x,y), and non-positive f(x,y) follows immediately as a corollary. Split any general f(x,y) into its positive and negative parts.)

This seems like the correct starting point for Christopher A. Fuchs and other Quantum Bayesians. But Fuchs seems to get messed up somewhere.

Andrew

I too am mystified by belief functions. They have some strange calculus to them that seems to have lost the whole notion that they were supposed to be the min and max of some set of probabilities. Then the requirements are progressively relaxed until I don’t see much of a point at all. Axiomatic nonsense.

That being said, I also don’t much care for min and max. Why do we look for min-max solutions? Corner solutions? If I poll experts and they each give a different probability….do I throw out 99.9% of the distribution and consider only the min and max? Quite a ridiculous proposition.

This is why second order probability makes so much more intuitive sense and carries with it the beauty of Bayesianism. (although you might have to consider just how updating takes place, whether probabilities or weights on prob measure are both updated etc).

Simply put, we consider the family of probability measures together with weights. It needn’t be continuous–it could be discrete. I believe but I am not sure, that quite a lot of the mechanics of decision theory, using utilities, can be replicated well enough using ambiguity aversion (see e.g., http://www.kellogg.northwestern.edu/faculty/klibanof/ftp/workpap/smoothambiguity.pdf ), although the premise of separability (of risk aversion and ambiguity aversion in preferences) may be suspect. (And some decision science academics seem more enamoured by prospect theory as a way of looking at human decision making under ambiguity, although I have not seen combinations with 2nd order probability).

But it really is this hierarchical framework which can allow us to step back and ask what is the point of the modelling exercise? Was it just to loosen some axioms to the point of unintelligibility or was it to get something useful, intuitive, with some ease of description? For decision making, I believe min-max is not always the best approach (although sometimes it is), and throwing away the (2nd order) distribution (of probability distributions) to just get the bounds throws away tons of useful information which may give us much more basis for coming to a decision. The notion of 2nd order probability has quite a lot going for it which belief functions gloss over in their quest for some mathematical purity.

Anyhow, just my two cents. I would think as well 2nd order probability, with its hierarchical Bayesian connection would make great sense to a hierarchical Bayesian modeller such as yourself.

If it’s all an approximation, why should we make it such a difficult one? Uncertainty is best tackled with heuristics anyway (and it seems this is the way the brain tends to work).