Comments on: Stacking, pseudo-BMA, and AIC type weights for combining Bayesian predictive distributions

By: Dan Simpson

Dan Simpson — Fri, 14 Apr 2017 13:58:49 +0000

In reply to Aki Vehtari. I'm not sure we were looking to reduce the sensitivity of the marginal likelihood to the prior. I'm a touch obsessed with priors and broadly think that that sensitivity is a good thing. It makes model selection harder, but that's why we're not doing model selection.

By: Aki Vehtari

Aki Vehtari — Fri, 14 Apr 2017 12:03:34 +0000

In reply to shravan. gamma(.1,.1) is not a bad choice for noise sigma in y ~ normal(mu, sigma), but half-cauchy could be used, too. I guess you rememebered warnings that gamma(.1,.1) should not be used as a prior in hierarchical models for population prior scale.

By: Aki Vehtari

Aki Vehtari — Fri, 14 Apr 2017 00:13:13 +0000

In reply to Pierre Jacob.

Multi-step IS is not an alternative to PSIS, it’s an alternative to one-step IS. Pareto smoothing can be used with both of them. In our case, we want to first use the fastest possible option which is one-step IS, and if that fails we do something else.

The method in the linked paper has two things I’m worried: 1) need for Markov kernel with knowledge of the mixing properties of the given kernel – which is not easy in high dimensional cases, 2) the adaptive version uses effective sample size estimate based on variance estimate, which is invalid if the variance is infinite and noisy even if the variance is finite. 2) would be easily improved by use of Pareto shape estimate, and PSIS would reduce the number of needed steps, but still I would be worried how to choose the Markov kernel in high dimensional cases.

By: Pierre Jacob

Pierre Jacob — Thu, 13 Apr 2017 15:57:48 +0000

In reply to Pierre Jacob. typo: a condition for the finite variance of the weights is "tau^2 is less than 2 sigma^2"

By: Pierre Jacob

Pierre Jacob — Thu, 13 Apr 2017 12:15:44 +0000

In reply to Andrew.

Andrew, here’s another link
http://www.lukebornn.com/papers/bornn_cjs_2010.pdf
and another one
http://www.people.fas.harvard.edu/~bornn/Papers/2010CJS/Bornn2010b_pre.pdf

The idea is that, to do IS in the “wrong direction”, you just need to introduce enough intermediate steps, e.g. via tempering: essentially, you can just take SMC samplers off the shelf.

For instance, to do IS from Normal(0, sigma^2) to Normal(0, tau^2) with tau > sigma, a condition for the finite variance of the weights is tau^2 sigma^2, instead of doing IS in one step and fail, you can introduce a sequence of distributions, say with variances (eta * sigma^2), (eta^2 * sigma^2), (eta^3 * sigma^2), etc, with eta between 1 and 2. It’s not going to take many steps to go from any sigma to any tau. And it gives Monte Carlo estimates with finite variance.

By: Dave C.

Dave C. — Thu, 13 Apr 2017 11:35:46 +0000

Maybe a tangent: If the underlying models are from GLMs, using quasilikelihoods rather than true likelihoods, BMA seems to get a little more difficult. Can you use quasilikelihoods with this stacking approach?

By: shravan

shravan — Thu, 13 Apr 2017 04:15:20 +0000

In reply to Yuling Yao. Since sigma is already constrained to be greater than zero you might as well use a cauchy(0,2.5) prior or something?

By: Daniel Lakeland

Daniel Lakeland — Thu, 13 Apr 2017 01:52:30 +0000

In reply to Wayne. Yep. Like the measurement errors are normally distributed with zero mean and SD 1. Well no the measurement errors are equal to the time varying fluctuation in charge across a capacitor caused by thermal noise and local electromagnetic fields induced by radio stations and etc.

By: Andrew

Andrew — Wed, 12 Apr 2017 22:29:00 +0000

In reply to Pierre Jacob. Pierre: Your first link above does not work. What is in this paper?

By: Yuling Yao

Yuling Yao — Wed, 12 Apr 2017 17:59:33 +0000

In reply to shravan. It tries to give a relatively flat prior on gamma. The point is that BMA can be sensitive to priors, even if n goes large.

By: Wayne

Wayne — Wed, 12 Apr 2017 15:27:56 +0000

In reply to Daniel Lakeland. Daniel: You're talking about someone randomly sampling from a somewhat-justified distribution and reifying this into a generation process, am I understanding correctly? This also happens a lot in terms of the word "chance" (as in "determined by chance"), where vague chance becomes a cause.

By: Pierre Jacob

Pierre Jacob — Wed, 12 Apr 2017 11:47:14 +0000

Fascinating stuff…

An alternative to your Pareto smoothed stuff: http://onlinelibrary.wiley.com/doi/10.1002/cjs.10045/full
(You don’t need to introduce bias in the importance weights to do cross-validation, you can just introduce intermediate steps… from 2010)

A lot of recent work on other scoring rules that address the sensitivity of marginal likelihoods from the prior
https://projecteuclid.org/euclid.ba/1423083641 (2015)
(also https://projecteuclid.org/euclid.aos/1336396184 (2012) in the discrete case)
(NOT every scoring rule is equivalent to the logarithmic scoring rule. This depends on your definition of “locality”. If you allow the scoring rule
to use derivatives of the predictive density at the observation, then there are other scoring rules, some of which are useful in the context of vague priors).

There’s obviously a vast literature on algorithms to estimate normalizing constants, but they’re not implemented in stan, so why bother mentioning them!?

By: shravan

shravan — Wed, 12 Apr 2017 09:42:00 +0000

On page 35 sigma has a gamma(.1,.1) prior. Is this a typo?

By: Daniel Lakeland

Daniel Lakeland — Wed, 12 Apr 2017 01:24:41 +0000

In reply to Phil. A lot of the time referring to the true data generating process is making the mistake of treating the problem as if there were a correct choice of random number generator but you just don't know what it is. Saying that there is some real physics going on is less controversial but maybe less useful in social sci or such

By: Yuling Yao

Yuling Yao — Wed, 12 Apr 2017 00:44:47 +0000

In reply to Bob Carpenter. To me it is like the same argument why simple additive model can work in many cases. I agree that stacking can be much worse than a mixture model is the true model is exactly a mixture model. In the extreme case: a mixture of gaussian with unknown location, stacking will estimate each sub-model to be identical, which makes no sense. However, the reason why researchers want to average a list of models is they believe each model is good enough to use. In real application, it is less likely that a user will average a list of gaussian; it is more likely that he wants to average a list of mixture models with different number of components!

By: Bob Carpenter

Bob Carpenter — Tue, 11 Apr 2017 21:48:59 +0000

In reply to Jonah.

They provide two motivations up front for avoiding mixture models—they’re costly to compute and can be hard to identify.

Then they use an example where data are generated as $latex y_n \sim \mathrm{Normal}(\beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3, 1)$ with parameters generated from $latex \beta_k \sim \mathrm{Normal}(0, 1)$ priors, then show that model averaging the predictions of $latex \mathrm{Normal}(\beta_k x_k, 1), k \in \{ 1, 2, 3\}$ is better than building a mixture model of the three components.

Two questions immediately arise:

1 How does the true model fit this data? It’s the continuous model expansion they say they favor up front and it’d show how much we lose by model averaging or stacking.

2. What happens when the misspecification is reversed in the mixture case? That is, what if we generate data according to a mixture then try to average the components? This would be interesting to see the mixing-at-the-whole-dataset level, too, as well as mixing at the observation level.

Section 3.1 is sort of related to (2)—it generates data from a Gaussian, then uses a mixture composed of a spread of fixed Gaussians. This is another one of those cases where the obvious continuous model expansion would be the thing to do in practice.

Section 3.6 discusses an example where they seem to be saying that stacking is better than continuous model expansion, but I’m not 100% sure.

By: Phil

Phil — Tue, 11 Apr 2017 18:28:14 +0000

In reply to Kevin S. Van Horn.

I’m not sure what you’re saying here. Do you disagree with the idea that there is a “true data-generating process”?

The data were generated somehow. The process that generated the data is the “true data-generating process.” It may be of arbitrary complexity and may include true quantum-mechanical randomness, but it certainly exists.

The true data-generating process can never be modeled perfectly. In some settings it can be modeled very well: coin flips, spins of a roulette wheel, etc., can usually be modeled quite well. But in many settings your model of the process is going to differ from the true process in important ways, so if you have several candidate models you can be sure none of them are all that close to the data-generating process.

Presumably you disagree with some part of what I just said. Which part?

By: Jonah

Jonah — Tue, 11 Apr 2017 18:04:14 +0000

In reply to Bob Carpenter. Looks like this is discussed in Section 3.3 of the paper.

By: Bob

Bob — Tue, 11 Apr 2017 16:50:29 +0000

Thank you, this is a fascinating paper.

By: Bob Carpenter

Bob Carpenter — Tue, 11 Apr 2017 15:44:57 +0000

How does what you’re doing with all this stacking compare to just writing a joint mixture with the same structure?

Is the key feature that there’s some kind of BUGS-like “cut” between the process of (a) fitting the models, and (b) combining them?

By: Kevin S. Van Horn

Kevin S. Van Horn — Tue, 11 Apr 2017 14:45:13 +0000

“the M-open setting in which the true data-generating process is not one of the candidate models being fit.”

I am suspicious of this oft-used phrase, “the true data-generating process.” It sounds to me an awful lot like an instance of Jaynes’s mind-projection fallacy–mistaking the uncertainty in your mind, your lack of information, for some sort of physical process. If you’re doing Bayesian statistics, then you’re using epistemic probabilities, and what’s relevant is whether the models you are using make use of all the significant and relevant information you have. If they don’t — because you don’t know how to incorporate all of the information, or it would be too computationally difficult, or you can’t afford to spend ten years on model-building research before doing the analysis — then *that* is what I would call the M-open setting.