The method in the linked paper has two things I’m worried: 1) need for Markov kernel with knowledge of the mixing properties of the given kernel – which is not easy in high dimensional cases, 2) the adaptive version uses effective sample size estimate based on variance estimate, which is invalid if the variance is infinite and noisy even if the variance is finite. 2) would be easily improved by use of Pareto shape estimate, and PSIS would reduce the number of needed steps, but still I would be worried how to choose the Markov kernel in high dimensional cases.

]]>http://www.lukebornn.com/papers/bornn_cjs_2010.pdf

and another one

http://www.people.fas.harvard.edu/~bornn/Papers/2010CJS/Bornn2010b_pre.pdf

The idea is that, to do IS in the “wrong direction”, you just need to introduce enough intermediate steps, e.g. via tempering: essentially, you can just take SMC samplers off the shelf.

For instance, to do IS from Normal(0, sigma^2) to Normal(0, tau^2) with tau > sigma, a condition for the finite variance of the weights is tau^2 sigma^2, instead of doing IS in one step and fail, you can introduce a sequence of distributions, say with variances (eta * sigma^2), (eta^2 * sigma^2), (eta^3 * sigma^2), etc, with eta between 1 and 2. It’s not going to take many steps to go from any sigma to any tau. And it gives Monte Carlo estimates with finite variance.

]]>Your first link above does not work. What is in this paper?

]]>An alternative to your Pareto smoothed stuff: http://onlinelibrary.wiley.com/doi/10.1002/cjs.10045/full

(You don’t need to introduce bias in the importance weights to do cross-validation, you can just introduce intermediate steps… from 2010)

A lot of recent work on other scoring rules that address the sensitivity of marginal likelihoods from the prior

https://projecteuclid.org/euclid.ba/1423083641 (2015)

(also https://projecteuclid.org/euclid.aos/1336396184 (2012) in the discrete case)

(NOT every scoring rule is equivalent to the logarithmic scoring rule. This depends on your definition of “locality”. If you allow the scoring rule

to use derivatives of the predictive density at the observation, then there are other scoring rules, some of which are useful in the context of vague priors).

There’s obviously a vast literature on algorithms to estimate normalizing constants, but they’re not implemented in stan, so why bother mentioning them!?

]]>Saying that there is some real physics going on is less controversial but maybe less useful in social sci or such

]]>Then they use an example where data are generated as $latex y_n \sim \mathrm{Normal}(\beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3, 1)$ with parameters generated from $latex \beta_k \sim \mathrm{Normal}(0, 1)$ priors, then show that model averaging the predictions of $latex \mathrm{Normal}(\beta_k x_k, 1), k \in \{ 1, 2, 3\}$ is better than building a mixture model of the three components.

Two questions immediately arise:

1 How does the true model fit this data? It’s the continuous model expansion they say they favor up front and it’d show how much we lose by model averaging or stacking.

2. What happens when the misspecification is reversed in the mixture case? That is, what if we generate data according to a mixture then try to average the components? This would be interesting to see the mixing-at-the-whole-dataset level, too, as well as mixing at the observation level.

Section 3.1 is sort of related to (2)—it generates data from a Gaussian, then uses a mixture composed of a spread of fixed Gaussians. This is another one of those cases where the obvious continuous model expansion would be the thing to do in practice.

Section 3.6 discusses an example where they seem to be saying that stacking is better than continuous model expansion, but I’m not 100% sure.

]]>The data were generated somehow. The process that generated the data is the “true data-generating process.” It may be of arbitrary complexity and may include true quantum-mechanical randomness, but it certainly exists.

The true data-generating process can never be modeled perfectly. In some settings it can be modeled very well: coin flips, spins of a roulette wheel, etc., can usually be modeled quite well. But in many settings your model of the process is going to differ from the true process in important ways, so if you have several candidate models you can be sure none of them are all that close to the data-generating process.

Presumably you disagree with some part of what I just said. Which part?

]]>Is the key feature that there’s some kind of BUGS-like “cut” between the process of (a) fitting the models, and (b) combining them?

]]>I am suspicious of this oft-used phrase, “the true data-generating process.” It sounds to me an awful lot like an instance of Jaynes’s mind-projection fallacy–mistaking the uncertainty in your mind, your lack of information, for some sort of physical process. If you’re doing Bayesian statistics, then you’re using epistemic probabilities, and what’s relevant is whether the models you are using make use of all the significant and relevant information you have. If they don’t — because you don’t know how to incorporate all of the information, or it would be too computationally difficult, or you can’t afford to spend ten years on model-building research before doing the analysis — then *that* is what I would call the M-open setting.

]]>