## I am the supercargo

In a form of sympathetic magic, many built life-size replicas of airplanes out of straw and cut new military-style landing strips out of the jungle, hoping to attract more airplanes. – Wikipedia

Twenty years ago, Geri Halliwell left the Spice Girls, so I’ve been thinking about Cargo Cults a lot.

As an analogy for what I’m gonna talk about, it’s … inapt, but only if you’ve looked up Cargo Cults. But I’m going with it because it’s pride week and Drag Race is about to start

The thing is, it can be hard to identify if you’re a member of a Cargo Cult. The whole point is that from within the cult, everything seems good and sensible and natural. Or, to quote today’s titular song,

They say “our John Frum’s coming,
He’s bringing cargo…” and the rest
At least they don’t expect to be
Surviving their own deaths.

This has been on my mind on and off for a while now. Mostly from a discussion I had with someone in the distant-enough-to-not-actually-remember-who-I-was-talking-to past, where we were arguing about something (I’m gonna guess non-informative vs informative priors, but honestly I do not remember) and this person suggested that the thing I didn’t like was a good idea, at least in part, because Harold Jeffreys thought it was a good idea.

A technical book written in the 1930s being used as a coup de grâce to end a technical argument in 2018 screams cargo cult to me. But is that fair? (Extreme narrator voice: It is not fair.)

I guess this is one of the problems we need to deal with as a field: how do we maintain the best bits of our old knowledge (pre-computation, early computation, MCMC, and now) while dealing with the rapidly evolving nature of modern data and modern statistical questions?

So how do you avoid cult like behaviour? Well, as a child of Nü-Metal*, I think there’s only one real answer:

Break stuff

I am a firm believer that before you use a method, you should know how to break it. Describing how to break something should be an essential part of describing a new piece of statistical methodology (or, for that matter, of resurrecting an existing one). At the risk of getting all Dune on you, he who can destroy a thing controls a thing.

(We’re getting very masc4masc here. Who’d’ve thought that me with a hangover was so into Sci-Fi & Nü-Metal? Next thing you know I’ll be doing a straight-faced reading of Ender’s Game. Look for me at 2am explaining to a woman who’d really rather not be still talking to me that they’re just called “buggers” because they look like bugs.)

So let’s break something.

This isn’t meant to last, this is for right now

Specifically let’s talk about breaking leave-one-out cross validation (LOO-CV) for computing the expected log-predictive density (elpd or sometimes LOO-elpd). Why? Well, partly because I also read that paper that Aki commented on a few weeks back that made me think more about the dangers of accidentally starting a cargo cult. (In this analogy, the cargo is a R package and a bunch of papers.)

One of the fabulous things about this job is that there are two things you really can’t control: how people will use the tools you construct, and how long they will continue to take advice that turned out not to be the best (for serious, cool it with the Cauchy priors!).

So it’s really important to clearly communicate flaws in method both when it’s published and later on. This is, of course, in tension with the desire to actually get work published, so we do  what we can.

Now, Aki’s response was basically definitive, so I’m mostly not going to talk about the paper. I’m just going to talk about LOO.

One step closer to the edge

One of the oldest criticisms of using LOO for model selection is that it is not necessarily consistent when the model list contains the true data generating model (the infamous, but essentially useless** M-Closed*** setting). This contrasts with model selection using Bayes’ Factors, which are consistent in the useless asymptotic regime. (Very into Nü-Metal. Very judgemental.)

Being that judge-y without explaining the context is probably not good practice, so let’s actually look at the famous case where model selection will not be consistent: Nested models.

For a very simple example, let’s consider two potential models:

$\text{M1:}\; y_i \sim N(\mu, 1)$

$\text{M2:}\; y_i \sim N(\mu + \beta x_i, 1)$

The covariate $x_i$ can be anything, but for simplicity, let’s take it to be $x_i \sim N(0,1)$.

And to put us in an M-Closed setting, let’s assume the data that we are seeing is drawn  from the first model (M1) with $\mu=0$In this situation, model selection based on the LOO-expected log predictive density will be inconsistent.

Spybreak!

To see this, we need to understand what the LOO methods are using to select models. It is the ability to predict a new data point coming from the (assumed iid) data generating mechanism. If two models asymptotically produce the same one point predictive distribution, then the LOO-elpd criterion will not be able to separate them.  This is different to Bayes’ factors, which will always choose the simplest of the models that make the same predictions.

Let’s look at what happens asymptotically. (And now you see why I focussed on such simple models: I’m quite bad at maths.)

Because these models are regular and have finite-dimensional parameters, they both satisfy all of the conditions of the Bernstein-von Mises theorem (which I once wrote about in these pages during an epic panic attack) which means that we know in both cases that the posterior for the  model parameters $\theta$ after observing n data points will be $\theta_j^{(n)} = \theta_{(j)}^* + \mathcal{O}_p(n^{-1/2})$. Here:

• $\theta_j^{(n)}$ is the random variable distributed according to the posterior for model j after $n$ obeservations,
• $\theta_{(j)}^*$ is the true parameter from model j that would generate the data. In this case $\theta_{(1)}^*=0$ and $\theta_{(2)}^*=(0,0)^T$.
• And $\mathcal{O}_p(n^{-1/2})$ is a random variable with (finite) standard deviation that goes to zero as increases like $n^{-1/2}$.

Arguing loosely (again: quite bad a maths), the LOO-elpd criterion is trying to compute**** $E_{\theta_j^{(n)}}\left[\log(p(y\mid\theta_j^{(n)}))\right]$ which asymptotically looks like $\log(p(y\mid\theta_j^*))+O(n^{-1/2})$.

This means that, asymptotically, both of these models will give rise to the same posterior predictive distribution and hence LOO-elpd will not be able to tell between them.

Take a look around

LOO-elpd can’t tell them apart, but we sure can! The thing is, the argument of inconsistency in this case only really holds water if you never actually look at the parameter estimates. If you know that you have nested models (ie that one is the special case of another), you should just look at the estimates to see if there’s any evidence for the more complex model.  Or, if you want to do it more formally, consider the family of potential nested models as your M-Complete model class and use something like projpred to choose the simplest one.

All of which is to say that this inconsistently is mathematically a very real thing but should not cause practical problems unless you use model selection tools blindly and thoughtlessly.

For a bonus extra fact: This type of setup will also cause the stacking weights we (Yuling, Aki, Andrew, and me) proposed not to stabilize. Because any convex combination will asymptotically give the same distribution. So be careful if you’re trying to interpret model stacking weights as posterior model probabilities.

Have a cigar

But I said I was going to break things. And so far I’ve just propped up the method yet again.

The thing is, there is a much bigger problem with LOO-elpd. The problem is the assumption that leaving one observation out is enough to get a good approximation to the average value of the posterior log-predictive over a new data set.  This is all fine when the data is iid draws from some model.

LOO-elpd can fail catastrophically and silently when the data cannot be assumed to be iid. A simple case where this happens is time-series data, where you should leave out the whole future instead.  Or spatial data, where you should leave out large-enough spatial regions that the point you are predicting is effectively independent of all of the points that remain in the data set. Or when your data has multilevel structure, where you really should leave out whole strata.

In all of these cases, cross validation can be a useful too, but it’s k-fold cross validation that’s needed rather than LOO-CV. Moreover, if your data is weird, it can be hard to design a cross validation scheme that’s defensible. Worse still, while LOO is cheap (thanks to Aki and Jonah’s work on the loo package), k-fold CV requires re-fitting the model a lot of times, which can be extremely expensive.

All of this is to say that if you want to avoid an accidental LOO cargo cult, you need to be very aware of the assumptions and limitations of the method and to use it wisely, rather than automatically. There is no such thing as an automatic statistician.

Notes:

* One of the most harrowing days of my childhood involved standing that the check out of the Target in Buranda (a place that has not changed in 15 year, btw) and having to choose between buying the first Linkin Park album and the first Coldplay album. You’ll be pleased to know that I made the correct choice.

** When George Box said that “All models are wrong” he was saying that M-Closed is a useless assumption that is never fulfilled.

*** The three modelling scenarios (according to Bernado and Smith):

• M-closed means the true data generating model is one of the candidate models $M_k$, although which one is  unknown to researchers
• M-complete refers to the situation where the true model exists (and we can specify the explicit form for it), but for some reason it is not in the list of candidate models.
• M-open refers to the situation in which we know the true model is not in the set of candidate models and we cannot specify it’s explicit form (this is the most common one).

**** A later edit: I forgot the logarithms in the expected log-densities, because by the time I finished this a drag queen had started talking and I knew it was time to push publish and finish my drink.

### 28 Comments

1. Andrew says:

Dan:

Yes, Aki and Jessy and I thought a lot about this when writing our first paper on Loo and Waic. Not all of this struggle made its way into the paper but there is section 3.5, and also this paragraph in section 3.8:

Cross-validation is like WAIC in that it requires data to be divided into disjoint, ideally conditionally independent, pieces. This represents a limitation of the approach when applied to structured models. In addition, cross-validation can be computationally expensive except in settings where shortcuts are available to approximate the distributions ppost(−i) without having to re-fit the model each time. For the examples in this article such shortcuts are available, but we used the brute force approach for clarity. If no shortcuts are available, common approach is to use k-fold cross-validation where data is partitioned in k sets. With moderate value of k, for example 10, computation time is reasonable in most applications.

• Dan Simpson says:

I think the thing we forget is that we sort of have to say it every time and foreground it more. But at the same time, I write blog posts about Nü-Metal, so I clearly am not king of scientific communication.

• Martha (Smith) says:

“I think the thing we forget is that we sort of have to say it every time and foreground it more.”

+1000

• Shravan says:

“Cross-validation is like WAIC in that it requires data to be divided into disjoint, ideally conditionally independent, pieces.”

Andrew, are you referring specifically to LOO-CV when you say cross-validation in the quote above?

2. Sam Clifford says:

I have seen some people in the same LOO-CV cargo cult, typically people who haven’t had a sufficient grounding in statistical theory or are only just dipping their toes into spatial statistics, particularly when they’re splitting their training/testing data sets completely at random for spatio-temporal modelling. Just no. You can’t just pass everything to caret to handle automatically and assume that everything’s good.

We seem to put a lot of faith in automatic decision making when it comes to model choice, whether it’s throwing a model to stepwise selection or doing cross-validation. Having a statistical ecologist to work with really forces you to defend every conclusion you come to, and I’m very thankful to Erin Peterson at QUT for pushing the reef and jaguar projects in the direction we went.

Oh, there appears to be a handful of LaTeX issues on the page here.

• Dan Simpson says:

LaTeX fixed! Thanks!

I agree with you on automatic decision making, although Aki and I have been talking a lot on how you could at least automate sensible partition selection for temporal, spatial, and spatiotemporal models. In these cases, the procedure for choosing the partitions is always approximately the same, so it should be possible to build in support (if not full automation).

• Sam Clifford says:

Making partitioning easier for cross-validation in spatio-temporal settings is a good idea; we should be making it as easy as possible to do the right thing so that we don’t force non-experts into doing bad statistics by saying that doing the right thing is trivially easy to do yourself.

3. Shravan says:

Can someone define M-closed for me? When I google it, I get an online clothing shop or something like that.

• Dan Simpson says:

It’s the case where the data is assumed to be generated from one of the models under consideration. The stacking paper (linked in the post) has the definitions. Or Bernardo and Smith.

• Dan Simpson says:

I’ve also added to the text

4. Anonymous says:

As a post pops up with weird titles and headings, i am sometimes finding myself checking whether they refer to songs/lyrics (as i think Dan Simpson has done in the past). This one didn’t disappoint in that regard, with 2 possible notable exceptions;

“I am the supercargo” – The Drones
“Break stuff” – Limp Bizkik
“Spybreak!” – Propellorheads
“Take a look around” – Limp Bizkit
“Have a cigar” – Pink Floyd

However the following 2 headings might nog be song titles, but (part of) song lyrics:

1) “This isn’t meant to last, this is for right now” – could be a lyric instead of a song title (“Last” – Nine Inch Nails)

2) “One step closer to the edge” – could be (part of) a lyric instead of a song title (“One step closer” – Linkin Park)

If this is correct, i’d personally would have prefered them all to be either (complete) song lyrics, or song titles. But that could just be me.

Regardles of the above: this post also reminded me of something i thought about recently. It occurred to me that many things can influence science and scientific writing. I recently had a distinct “feeling” after listening to several songs by a particular artist and music genre that i never really listened to on a loop for a few days, and it sort of “opened up” a new door in my mind, perhaps involving creativity and/or simply not giving a f#ck anymore and just write what “flows” out naturally at the time that i don’t remember ever having had before. I don’t think that was the only thing that influenced my writing at that time, but i feel/think it really contributed to it. It may perhaps sound weird, but my point is that many things can positively influence scientific writing.

5. Anonymous says:

“One of the most harrowing days of my childhood involved standing that the check out of the Target in Buranda (a place that has not changed in 15 year, btw) and having to choose between buying the first Linkin Park album and the first Coldplay album. You’ll be pleased to know that I made the correct choice.”

Not a fan of Linkin Park myself, but if you didn’t (since then) bought Coldplay’s 1st album, i hope you bought (or at least listened to) their 2nd album: “A rush of blood to the head”. It even has a song titled “The Scientist” on it. My favorite song from that album is “Amsterdam”:

https://www.youtube.com/watch?v=vblNj75hUpM

6. takebakawashi says:

LOO-elpd can fail catastrophically and silently when the data cannot be assumed to be iid.

I have a sad tale to tell about this but I’d best not — suffice it to say that the very smart person who messed this up was not me.

7. Radford Neal says:

I have to admit to being too lazy to actually do simulations to tell for sure, but my intuition is that LOO works for this situation (i.e., selects the right model more than half the time, not all the time, of course). I don’t see anything in your post that would demonstrate that it doesn’t work. LOO is something one applies to a finite data set, so what happens asymptotically is not relevant. Or are you perhaps claiming that the fraction of time LOO selects the right model approaches 1/2 asymptotically? My intuition would argue against even that. A key point is that the MLE and the posterior distributions will not fix beta to zero, and this is going to make things worse for the second model in the LOO assessments, since it actually is zero (as is mu as well). Even if the amount by which the second model does worse than the first gets smaller and smaller asymptotically, that wouldn’t necessarily prevent LOO from usually choosing the first model, on the basis that it does a tiny bit better.

It’s also not clear what LOO method you’re talking about. One could do LOO based on MLE estimates, or based on posterior predictive distributions, and one could score the results by squared error (guessing the mean value), or by negative log probability, or by various other criteria.

• Dan Simpson says:

The specific LOO method I’m talking about is using LOO to approximate the expected log-posterior predictive density. This LOO approximation is consistent (under independence), so it reduces to a proper scoring rule, which means that it will select asymptotically the model with the best posterior predictive density

So I think it does work in the sense that it will always return a method that has a good posterior predictive distribution. It’s just that in cases where there are nested models, it might not return the simpler model. That has been argued for a long time as a negative (this paper from Shao is a classic https://www.jstor.org/stable/2290328?seq=1#page_scan_tab_contents).

8. Keith O'Rourke says:

Was reading someone the other day who suggested the method of argument for most people unavoidably had to be argument by authority – argument by science is just too hard/costly for most and highly unstable for rulers and cliques.

So perhaps we are stuck with most statisticians arguing by authority :-(

Of course increasing the percentage who will think scientifically should be a very high priority.

It is a shame how discussions like this seem to have so little impact https://scholar.google.ca/scholar?hl=en&as_sdt=0%2C5&q=How+do+we+choose+our+default+methods%3F&btnG= https://scholar.google.ca/scholar?hl=en&as_sdt=0%2C5&q=Convincing+evidence+Gelman+O%27Rourke&btnG=

p.s. First time I have missed Toronto pride since we got the apartment at Church and Wellesly 6+ years ago – next year!

• Keith O'Rourke says:

OK for those who wish to know: “The method of authority will always govern the mass of mankind; and those who wield the various forms of organized force in the state will never be convinced that dangerous reasoning ought not to be suppressed in some way. If liberty of
speech is to be untrammeled from the grosser forms of constraint, then uniformity of opinion will be secured by a moral terrorism to which the respectability of society will give its thorough approval.” http://www.bocc.ubi.pt/pag/peirce-charles-fixation-belief.pdf

Seems topical today and would explain why I was banned from Cochrane’s Statistical Methods Group.

The apparent upside “Following the method of authority is the path of peace”.

• Maybe we can get Sander Greenland back here, once he is free. I also think to widen the pool of participants would be enlivening b/c of sometimes the state of unknowingness is helpful

• Dan Simpson says:

I don’t mind people arguing by authority, it’s just that when it’s done it’s very important to sketch out the boundaries of the advice as clearly as possible. Doing it deep into section 3 of a paper is totally fine for communicating with other statisticians, but one of the things about LOO-elpd is that the Stan project has been strongly advocating its use (and providing a really nice interface for doing it). In this context, it’s less obvious to me what the best way to sketch the boundaries of the applicability and if it’s something we’re doing well enough. That article that Aki responded to suggests that maybe we’re not doing well enough…

• Keith O'Rourke says:

> very important to sketch out the boundaries of the advice

I agree, that was the motivation for this post “I think the statistical discipline needs to take more responsibility for the habits of inference they instill in others” http://statmodeling.stat.columbia.edu/2017/10/05/missing-will-paper-likely-lead-researchers-think/

My sense of the author’s response was that someone with adequate statistical understanding who reads the limitations stated in various places in the paper and interprets them correctly would be able to discern if the fixed effects model was appropriate for their purposes. But as you put it earlier “I think the thing we forget is that we sort of have to say it every time and foreground it more.”

Short summary of the main issue there: The insurmountable? uncertainty regarding fixed effectS is primarily knowing what is causing the extra-sampling variation being observed in repeated or similar studies (or even not being observed). Is it due to treatment variation with subject features (interaction), identifiable variation of treatments being actually given or study conduct and reporting quality (whatever it is that detracts from validity in research)?

If it is just treatment variation with subject features, then for a fixed mixture of patient features the average effect will be fixed and well defined. That will enable one to generalize to a population with the same mix or post-stratify to another mix. However, I believe the author failed to make it clear to readers that this certainty about there only being treatment variation with subject features was required.

Additionally I believe it is very rare in most research fields to be fairly sure that the extra sampling variation is primarily being driven by treatment variation with subject features.

• Keith,

I agree with your main premise. But I think that not everyone is cut out to be communication and analytical genius either. Besides it is in the informal conversations where I have been able to discern the logical fallacies and some measure of cognitive biases operating.

BTW is that article linked, by you and Sander Greenland, behind a paywall?

• Keith O’Rourke says:

This is the only journal article I published with him and this link says free from my access site https://academic.oup.com/biostatistics/article/2/4/463/321492

Here is the abstract in case that might suffice –
“Results from better quality studies should in some sense be more valid or more accurate than results from other studies, and as a consequence should tend to be distributed differently from results of other studies. To date, however, quality scores have been poor predictors of study results. We discuss possible reasons and remedies for this problem. It appears that ‘quality’ (whatever leads to more valid results) is of fairly high dimension and possibly non‐additive and nonlinear, and that quality dimensions are highly application‐specific and hard to measure from published information. Unfortunately, quality scores are often used to contrast, model, or modify meta‐analysis results without regard to the aforementioned problems, as when used to directly modify weights or contributions of individual studies in an ad hoc manner. Even if quality would be captured in one dimension, use of quality scores in summarization weights would produce biased estimates of effect. Only if this bias were more than offset by variance reduction would such use be justified. From this perspective, quality weighting should be evaluated against formal bias‐variance trade‐off methods such as hierarchical (random‐coefficient) meta‐regression. Because it is unlikely that a low‐dimensional appraisal will ever be adequate (especially over different applications), we argue that response‐surface estimation based on quality items is preferable to quality weighting. Quality scores may be useful in the second stage of a hierarchical response‐surface model, but only if the scores are reconstructed to maximize their correlation with bias.”

• Keith O’Rourke says:

> not everyone is cut out to be communication and analytical genius either.
Yes, but everyone in academia (who publishes a paper in a journal) should be open to criticism by others – even if it comes from someone who is not a communication and analytical genius but even awkward and uninformed.

As my former director ( https://en.wikipedia.org/wiki/Allan_S._Detsky ) put it – “if you get a letter to the editor on one of your papers that is uniformed and wrong that means you did not communicate something clearly enough in your paper. Fortunately, you now have a change to fit it. So do so.”

• Keith O’Rourke says:

> you now have a change to fit it.
Argh – you now have a chance to fix it.

• Me? I have to reflect on how, if indeed warranted. I think you all are way more qualified than I. At the very least it requires more thought on my part. LOL Being a newcomer means that sometimes I’m playing catchup. But sometimes I sprint and dance around. heee heee.

• Criticism is unavoidable regardless. And a necessary enterprise. Nevertheless, sociology of expertise is neglected. Sometimes it’s an enterprise to carve out a distinguishing unique position that is defensible career wise.

Not sure about your area as much as I am sure about the domain of int. relations & political science. In so far as communication and analytics, problems of lack of clarity are acute for everyone. I think though, as I have continued to propose since 2004 or so. Basic logic is eclipsed in favor of proposing more complex models which seemingly have the veneer of predictive import.

9. Aki Vehtari says:

I’ve been one week offline, so I’m bit late commenting this excellent post.

> M-complete refers to the situation where the true model exists (and
> we can specify the explicit form for it), but for some reason it is
> not in the list of candidate models.

Bernardo & Smith p. 385 write “M-completed view, corresponds to an individual acting as if {M_i, i \in I} simply constitute a range of specified models currently available for comparison, to be valuated in the light of the individuals seprate actual belief model”.

Vehtari & Ojanen (2012) http://dx.doi.org/10.1214/12-SS102 write about the actual belief model: “When a rich enough model, describing well the knowledge about the modeling problem and capturing the essential prior uncertainties, is constructed and there are no substantial deficiencies found in model criticism phase, we follow (Bernardo & Smith, 1994) and call such a model the *actual belief model*, and denote it by M_*.”.

There is no need to assume that the true model exists or that we could specify the explicit form for it. It is enough that we have a model that is describing the future in a way best reflecting the individuals knowledge and uncertainties. This model is used to evaluate the other models. projpred is based on this idea.

I have objected term LOOIC as all *IC have been most of the time used as some magical number, and I am really worrying that we’ll see LOOIC joining the *IC cargo cult.

We are working on making it easier to do predictive model comparison (M-open and M-completed) for different types of models. We’ll probably have something to tell before StanCon Helsinki.