## Cross-validation, LOO and WAIC for time series

This post is by Aki.

Jonah asked in Stan users mailing list

Suppose we have J groups and T time periods, so y[t,j] is the observed value of y at time t for group j. (We also have predictors x[t,j].) I’m wondering if WAIC is appropriate in this scenario assuming that our interest in predictive accuracy is for existing groups only (i.e. we might get data for new time periods but only for the same J groups). My hunch is that this scenario requires a more complicated form of cross-validation that WAIC does not approximate, but the more I think about it the more confused I seem to become. Am I right that WAIC is not appropriate here?

I’ll try to be more specific than in my previous comments on this topic.

As WAIC is an approximation of leave-one-out (LOO) cross-validation, I’ll first start considering when LOO is appropriate for time series.

LOO is appropriate if we are interested how well our model describes structure in the observed time series. For example, in the birthday example (BDA3 p. 505 and here), we can say that we have learned about the structure if we can predict any single date with missing data and thus LOO is appropriate. Here we are not concerned so much about the birthdays in the future. The fact that the covariate x is deterministic (fixed) doesn’t change how we estimate the expected predictive performance (for a single date with missing data), but since x is fixed there is no uncertainty of the future values of x.

If we are interested in making predictions for the next not yet observed date and we want to get better estimate than LOO for the expected predictive performance we can use sequential prediction. I don’t recommend using all the terms
p(y_1)p(y_2,|y_1)p(y_3|y_1,y_2)…p(y_{T}|y_{1..T-1})
because the beginning of this series is sensitive to prior. I would use terms
p(y_k|y_{1..k-1})p(y_{k+1}|y_{1..k})…p(y_{T}|y_{1..T-1})
How many terms (k-1) to remove depends on the properties of the time series.

When the number time points is much larger than the number of hyperparameters \theta, to make the series even more stable and to better correspond the prediction task I would define
p(y_k|y_{1..k-1})=int p(y_k|y_{1..k-1},\theta)p(\theta|y_{1..T}) d\theta

If we are interested in making predictions for several not yet observed dates I recommend using, for example for d days ahead prediction
p(y_{k..k+d}|y_{1..k-1})p(y_{k+1…k+d+1}|y_{1..k})…p(y_{T-d,…T},|y_{1..T-d-1})

If we are interested in making predictions for future dates, we could still use LOO to select a model which can describe well the structure in the time series. It is likely that such model would also make good predictions for future data, but LOO will give an optimistic estimate of the expected predictive performance (for the next not yet observed date). This bias may be such that it does not affect which model is selected. This optimistic bias is harmful, for example, if we use the predictions for resource allocation and due to underestimating how difficult is to predict the future we might end not allocating enough resources (doctors for handling births, electricity generation to match the load, etc.).

If we are interested in making predictions for future dates, I think it is OK to use LOO in preliminary phase but sequential methods should be used for final reporting and decision making. Reason for using LOO could be that we can get LOO estimate with a small additional computational cost after the full posterior inference. LOO approximations, which are obtained as a by-product or with a small additional cost after the full posterior inference has been made, are discussed in the papers Bayesian leave-one-out cross-validation approximations for Gaussian latent variable models  and WAIC and cross-validation in Stan.

Note that when using Kalman filter type inference for time series models, these sequential estimates can be obtained as a by product or only with a small additional cost.

So now I’ve covered when LOO or the sequential approach is appropriate for time series and I’ll return to the actual question which states

(i.e. we might get data for new time periods but only for the same J groups)

That is, the group ids are fixed and time periods are deterministic

As I told before, LOO (WAIC) is fine for estimating whether the model has found some structure in the data and it does not matter that x is combination of fixed and deterministic parts. If it is important to know the actual predictive performance for the future data, you need to use a version of the sequential approach.

WAIC is just and approximation of LOO. I’m now convinced that there is no need to use WAIC. The paper Bayesian leave-one-out cross-validation approximations for Gaussian latent variable models shows that there are better methods than WAIC for Gaussian latent variable models. We are also working on a better method to approximate LOO in Stan (maybe we call it Very Good Information Criterion?). I just need to make some additional experiments and write the paper…

1. Avraham Adler says:

Thank you, Aki.

If I understand correctly, much of the issue with time series is that the future points somehow depend on previous points, so the sequential method is preferred. What about cases where we have n observations, and we want to predict the value of observation n+1 from the same pool. Yes it is a future observation, but, simplistically, it is really not dependent on the values of the previous observation. The value of the previous observations is to help us get a better understanding of the pool. In those cases, LOO/WAIC should be applicable, correct? Practically, when estimating insurance losses, for example, we will not try an estimate the change due to time (e.g. loss inflation) as part of the model, but try to bring observations to current levels first, so as to have a pool of samples from ostensibly the same version. Also, we’re lucky to have 10 to 15 observations (at least in reinsurance).

Thank you.

• Aki Vehtari says:

Yes

2. Sumio Watanabe says:

Dear Professor Aki Vehtari,

As you know very well, predictive methods will be evaluated from bias and variance. Cross validation (CV) is always an unbiased estimator of G(n-1) which is the generalization error of (n-1) samples, but its variance is sometimes not small. DIC is not an unbiased estimator of G(n-1), (it is not even asymptotically for learning machines such as a neural network, a Boltzmann machine, and a normal mixture), but its variance is small if the shape of the posterior resembles gaussian. If the posterior is equal to a guassian, then DIC is a good method. WAIC is not an unbiased estimator of G(n-1) but an asymptotically unbiased estimator even for learning machines and its variance is sometimes a little smaller than CV. Therefore, I expect that WAIC may be useful in learning machines which has deep hierarchical structures. There are a lot of statistical models and the most suitable predictive method may depend on them.

From the mathematical point of view, there has been no mathematical theory of Bayesian CV. CV’s behavior as a random variable was clarified by using WAIC theory. Moreover, WAIC was firstly derived as a partial integration of the generalization error on functional space. Asymptotic equivalence of CV and WAIC tells us that the cross validation in statistics is mathematically equal to the functional partial integration of the generalization error. I think this point is important in statistical understanding CV. This is the reason why “(CV-Sn) +(G-S)” is asymptotically a constant. (Sn and S are respectively empirical and average entropy of the true distribution). If CV-Sn is large, then G-S is small. CV has this weak point, and almost all information criteria have the same weak point if it is asymptotically equivalent to CV.

• Aki Vehtari says:

Dear Professor Sumio Watanabe,

First I want to say that your work on mathematical theory of Bayesian CV and WAIC has been very important and useful it and it has greatly influenced my thinking. Even if don’t recommend WAIC anymore, I recommend people to read your articles on mathematical theory of Bayesian CV.

In my experiments I’ve never seen DIC working better than CV (or WAIC).

In my experiments CV usually has a better performance than WAIC. In my experiments WAIC may have had a smaller variance than importance sampling LOO, but a similar variance as truncated importance sampling LOO.

Your comment about CV’s weak point (if CV-Sn is large, then G-S is small) is very important but unfortunately not well known enough. In my experiments I’ve seen it’s effect in model selection. Luckily for model selection there are methods with reduced variance and increased performance (paper coming out soon).

3. Avraham Adler says:

Dr. Vehtari, if I understand you correctly in this paper , importance sampling LOO requires the assumption of independence of the observations being used to calibrate the model whereas WAIC does not. Given the assumption of independence, in your expertise, would you say that IS-LOO is better than WAIC? Presumably, if there is any uncertainty as to the independence of the observations, if a full n iterations of LOO-CV is not possible, one should use WAIC? Thank you, as always.

4. Howard says:

Does anyone have an example in Stan + R for implementing the method of calculating: p(y_k|y_{1..k-1})p(y_{k+1}|y_{1..k})…p(y_{T}|y_{1..T-1}) ?

It seems to me that a model would need to be fit at every time point t = 1,…,T, and then the above measure would be calculated. Such result would then be used in calculating the measure at time point t+1 and so on. Namely, it appears that you cannot calculate the model using a single Stan run. Or is there more to it?

Thanks,