I’m not sure we were looking to reduce the sensitivity of the marginal likelihood to the prior. I’m a touch obsessed with priors and broadly think that that sensitivity is a good thing. It makes model selection harder, but that’s why we’re not doing model selection.

]]>gamma(.1,.1) is not a bad choice for noise sigma in y ~ normal(mu, sigma), but half-cauchy could be used, too. I guess you rememebered warnings that gamma(.1,.1) should not be used as a prior in hierarchical models for population prior scale.

]]>Multi-step IS is not an alternative to PSIS, it’s an alternative to one-step IS. Pareto smoothing can be used with both of them. In our case, we want to first use the fastest possible option which is one-step IS, and if that fails we do something else.

The method in the linked paper has two things I’m worried: 1) need for Markov kernel with knowledge of the mixing properties of the given kernel – which is not easy in high dimensional cases, 2) the adaptive version uses effective sample size estimate based on variance estimate, which is invalid if the variance is infinite and noisy even if the variance is finite. 2) would be easily improved by use of Pareto shape estimate, and PSIS would reduce the number of needed steps, but still I would be worried how to choose the Markov kernel in high dimensional cases.

]]>typo: a condition for the finite variance of the weights is “tau^2 is less than 2 sigma^2”

]]>Andrew, here’s another link

http://www.lukebornn.com/papers/bornn_cjs_2010.pdf

and another one

http://www.people.fas.harvard.edu/~bornn/Papers/2010CJS/Bornn2010b_pre.pdf

The idea is that, to do IS in the “wrong direction”, you just need to introduce enough intermediate steps, e.g. via tempering: essentially, you can just take SMC samplers off the shelf.

For instance, to do IS from Normal(0, sigma^2) to Normal(0, tau^2) with tau > sigma, a condition for the finite variance of the weights is tau^2 sigma^2, instead of doing IS in one step and fail, you can introduce a sequence of distributions, say with variances (eta * sigma^2), (eta^2 * sigma^2), (eta^3 * sigma^2), etc, with eta between 1 and 2. It’s not going to take many steps to go from any sigma to any tau. And it gives Monte Carlo estimates with finite variance.

]]>Since sigma is already constrained to be greater than zero you might as well use a cauchy(0,2.5) prior or something?

]]>Yep. Like the measurement errors are normally distributed with zero mean and SD 1. Well no the measurement errors are equal to the time varying fluctuation in charge across a capacitor caused by thermal noise and local electromagnetic fields induced by radio stations and etc.

]]>Pierre:

Your first link above does not work. What is in this paper?

]]>It tries to give a relatively flat prior on gamma. The point is that BMA can be sensitive to priors, even if n goes large.

]]>Daniel: You’re talking about someone randomly sampling from a somewhat-justified distribution and reifying this into a generation process, am I understanding correctly? This also happens a lot in terms of the word “chance” (as in “determined by chance”), where vague chance becomes a cause.

]]>An alternative to your Pareto smoothed stuff: http://onlinelibrary.wiley.com/doi/10.1002/cjs.10045/full

(You don’t need to introduce bias in the importance weights to do cross-validation, you can just introduce intermediate steps… from 2010)

A lot of recent work on other scoring rules that address the sensitivity of marginal likelihoods from the prior

https://projecteuclid.org/euclid.ba/1423083641 (2015)

(also https://projecteuclid.org/euclid.aos/1336396184 (2012) in the discrete case)

(NOT every scoring rule is equivalent to the logarithmic scoring rule. This depends on your definition of “locality”. If you allow the scoring rule

to use derivatives of the predictive density at the observation, then there are other scoring rules, some of which are useful in the context of vague priors).

There’s obviously a vast literature on algorithms to estimate normalizing constants, but they’re not implemented in stan, so why bother mentioning them!?

]]>A lot of the time referring to the true data generating process is making the mistake of treating the problem as if there were a correct choice of random number generator but you just don’t know what it is.

Saying that there is some real physics going on is less controversial but maybe less useful in social sci or such

]]>To me it is like the same argument why simple additive model can work in many cases. I agree that stacking can be much worse than a mixture model is the true model is exactly a mixture model. In the extreme case: a mixture of gaussian with unknown location, stacking will estimate each sub-model to be identical, which makes no sense. However, the reason why researchers want to average a list of models is they believe each model is good enough to use. In real application, it is less likely that a user will average a list of gaussian; it is more likely that he wants to average a list of mixture models with different number of components!

]]>They provide two motivations up front for avoiding mixture models—they’re costly to compute and can be hard to identify.

Then they use an example where data are generated as $latex y_n \sim \mathrm{Normal}(\beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3, 1)$ with parameters generated from $latex \beta_k \sim \mathrm{Normal}(0, 1)$ priors, then show that model averaging the predictions of $latex \mathrm{Normal}(\beta_k x_k, 1), k \in \{ 1, 2, 3\}$ is better than building a mixture model of the three components.

Two questions immediately arise:

1 How does the true model fit this data? It’s the continuous model expansion they say they favor up front and it’d show how much we lose by model averaging or stacking.

2. What happens when the misspecification is reversed in the mixture case? That is, what if we generate data according to a mixture then try to average the components? This would be interesting to see the mixing-at-the-whole-dataset level, too, as well as mixing at the observation level.

Section 3.1 is sort of related to (2)—it generates data from a Gaussian, then uses a mixture composed of a spread of fixed Gaussians. This is another one of those cases where the obvious continuous model expansion would be the thing to do in practice.

Section 3.6 discusses an example where they seem to be saying that stacking is better than continuous model expansion, but I’m not 100% sure.

]]>I’m not sure what you’re saying here. Do you disagree with the idea that there is a “true data-generating process”?

The data were generated somehow. The process that generated the data is the “true data-generating process.” It may be of arbitrary complexity and may include true quantum-mechanical randomness, but it certainly exists.

The true data-generating process can never be modeled perfectly. In some settings it can be modeled very well: coin flips, spins of a roulette wheel, etc., can usually be modeled quite well. But in many settings your model of the process is going to differ from the true process in important ways, so if you have several candidate models you can be sure none of them are all that close to the data-generating process.

Presumably you disagree with some part of what I just said. Which part?

]]>Looks like this is discussed in Section 3.3 of the paper.

]]>Is the key feature that there’s some kind of BUGS-like “cut” between the process of (a) fitting the models, and (b) combining them?

]]>I am suspicious of this oft-used phrase, “the true data-generating process.” It sounds to me an awful lot like an instance of Jaynes’s mind-projection fallacy–mistaking the uncertainty in your mind, your lack of information, for some sort of physical process. If you’re doing Bayesian statistics, then you’re using epistemic probabilities, and what’s relevant is whether the models you are using make use of all the significant and relevant information you have. If they don’t — because you don’t know how to incorporate all of the information, or it would be too computationally difficult, or you can’t afford to spend ten years on model-building research before doing the analysis — then *that* is what I would call the M-open setting.

]]>