It seems to me that a model would need to be fit at every time point t = 1,…,T, and then the above measure would be calculated. Such result would then be used in calculating the measure at time point t+1 and so on. Namely, it appears that you cannot calculate the model using a single Stan run. Or is there more to it?

Thanks,

]]>First I want to say that your work on mathematical theory of Bayesian CV and WAIC has been very important and useful it and it has greatly influenced my thinking. Even if don’t recommend WAIC anymore, I recommend people to read your articles on mathematical theory of Bayesian CV.

In my experiments I’ve never seen DIC working better than CV (or WAIC).

In my experiments CV usually has a better performance than WAIC. In my experiments WAIC may have had a smaller variance than importance sampling LOO, but a similar variance as truncated importance sampling LOO.

Your comment about CV’s weak point (if CV-Sn is large, then G-S is small) is very important but unfortunately not well known enough. In my experiments I’ve seen it’s effect in model selection. Luckily for model selection there are methods with reduced variance and increased performance (paper coming out soon).

]]>As you know very well, predictive methods will be evaluated from bias and variance. Cross validation (CV) is always an unbiased estimator of G(n-1) which is the generalization error of (n-1) samples, but its variance is sometimes not small. DIC is not an unbiased estimator of G(n-1), (it is not even asymptotically for learning machines such as a neural network, a Boltzmann machine, and a normal mixture), but its variance is small if the shape of the posterior resembles gaussian. If the posterior is equal to a guassian, then DIC is a good method. WAIC is not an unbiased estimator of G(n-1) but an asymptotically unbiased estimator even for learning machines and its variance is sometimes a little smaller than CV. Therefore, I expect that WAIC may be useful in learning machines which has deep hierarchical structures. There are a lot of statistical models and the most suitable predictive method may depend on them.

From the mathematical point of view, there has been no mathematical theory of Bayesian CV. CV’s behavior as a random variable was clarified by using WAIC theory. Moreover, WAIC was firstly derived as a partial integration of the generalization error on functional space. Asymptotic equivalence of CV and WAIC tells us that the cross validation in statistics is mathematically equal to the functional partial integration of the generalization error. I think this point is important in statistical understanding CV. This is the reason why “(CV-Sn) +(G-S)” is asymptotically a constant. (Sn and S are respectively empirical and average entropy of the true distribution). If CV-Sn is large, then G-S is small. CV has this weak point, and almost all information criteria have the same weak point if it is asymptotically equivalent to CV.

]]>If I understand correctly, much of the issue with time series is that the future points somehow depend on previous points, so the sequential method is preferred. What about cases where we have n observations, and we want to predict the value of observation n+1 from the same pool. Yes it is a future observation, but, simplistically, it is really not dependent on the values of the previous observation. The value of the previous observations is to help us get a better understanding of the pool. In those cases, LOO/WAIC should be applicable, correct? Practically, when estimating insurance losses, for example, we will not try an estimate the change due to time (e.g. loss inflation) as part of the model, but try to bring observations to current levels first, so as to have a pool of samples from ostensibly the same version. Also, we’re lucky to have 10 to 15 observations (at least in reinsurance).

Thank you.

]]>