The “Canadian lynx data” is one of the famous examples used in time series analysis. And the usual models that are fit to these data in the statistics time-series literature, don’t work well. Cavan Reilly and Angelique Zeringue write:

Reilly and Zeringue then present their analysis. Their simple little predator-prey model with a weakly informative prior way outperforms the standard big-ass autoregression models. Check this out:

Or, to put it into numbers, when they fit their model to the first 80 years and predict to the next 34, their root mean square out-of-sample error is 1480 (see scale of data above). In contrast, the standard model fit to these data (the SETAR model of Tong, 1990) has more than twice as many parameters but gets a worse-performing root mean square error of 1600, even when that model is fit to the entire dataset. (If you fit the SETAR or any similar autoregressive model to the first 80 years and use it to predict the next 34, the predictions are a disaster—the predicted values quickly go toward the mean and can’t even attempt to track the curve.)

As Reilly and Zeringue note, the above graph shows potential room for improvement in the model, but even as is, it shows the huge benefits that can be obtained by attempting to model the underlying process rather than simply fitting the data using a conventional family of models.

(It’s funny for me to emphasize this point, given how often I use conventional models such as linear and logistic regression.)

P.S. The title and text above have been modified to reflect comments below with reference to models fit to the lynx data in the ecology literature. There appears to be not enough communication between ecologists and statisticians. The statistical point above still holds—a simple model with some reasonable structure can outperform a generic data-fitting model such as an autoregression—but you should probably check out some of the references given in the comments if you’re interested in the lynx example or ecology models more generally.

at the bottom I find two questions:

1) what is a model?

2) what is a model good for?

I think Cox, but also Diggle and Lindsey give clear answers.

1) The simplest probabilistic model is the i.i.d. assumption.

2) Stat. inference is based on probabilistic models.

To build or to fit a physical or empirical model or to work without any model at all depends on the purpose of the analysis.

One quick question:

What would be the RMSE in predicting the last 30 years of data if I simply took the first 30 years and literally copy and pasted them at the end of the year 80?

I wonder if it beats all other approaches in a forecasting competition. Though obviously one could do better by matching the phase before pasting.

PS the forecast in the picture does well with the Amplitude and Frequency, but it seems out of phase. Even so, by getting two parameters right, it is enough to do very well.

Hy John – the ‘lynx’ data set is distributed with R, so I just did a check of an even simpler baseline, just predict the mean:

> rmse rmse(lynx, mean(lynx))

[1] 1578.873

> rmse(lynx[-(1:80)], mean(lynx[1:80]))

[1] 1825.286

Your suggestion indeed has a good RMSE:

> rmse(lynx[(1901:1934)-start(lynx)[1]], lynx[4:37])

[1] 1131.475

Of course, it’s not legal to “match the phase before pasting”, because that last 34 years is the test data.

In Reilly and Zeringue’s article, it’s weird to see no mention of Recursive Bayesian estimation, which is the general framework for what they are doing. How many times has Recursive Bayesian estimation been independently invented?

Autoregression provides no explanation in the sense of, say David Deutsch’s _The Beginning of Infinity: explanations that transform the world_. It just presumes that the future will repeat patterns that we have observed in the past. (See also, for example, Judea Pearl’s criticism of Karl Pearson’s characterization of statistics, and more generally science, as contingency tables.)

To provide good time-series explanations (‘hard to vary, in which all of the details play a functional role’, Deutsch, p. 24), you need to model phenomena as dynamic, causal processes (e.g., lynxes and hares reproduce; lynxes eat hares). Statistics then participates deductively, by repeated application of Bayes’ rule. Successful predictions in turn arise because of the high quality of the explanation of what has already been observed, not just in terms of “R^2”, but more importantly in improving our (causal) understanding of what is going on.

I’m having trouble reading that scanned plot, is the wavy line around 4,000 the hare line? (I was going to make a hareline joke here, but my question is real. :-)

The funny thing is, as an ecologist, I consider the model described in the article to be a very poor model of the underlying process; they assume Lotka-Volterra (LV) dynamics (that is, exponential prey growth, and linear predation rates). This ignores two important mechanisms: self-inhibition of hare growth by their own density, and predator saturation. They mention the fact that the LV model is structurally unstable, but brush that fact off as unimportant, and don’t bother to cite any of the existing ecological models this time series using realistic mechanisms(a good review of that literature came out in Peter Turchin’s “Complex population dynamics” the year before). They also ignore the significant measurement error problems inherent in harvest data: if the trappers are trying to increase their economic gains, they’ll likely exert more effort when the lynx population peaks, so it’ll accent peaks and troughs.

By ignoring important ecological mechanisms, any parameter estimates from their model are likely not terribly useful for ecologists or managers working on the system (not that auto-regressive coefficients would have been any handier). The statistical methods they use are very interesting,but it would have been a much stronger paper if they had bothered to bring an actual population ecologist on board.

Eric:

That all makes sense. But what should amaze you much more is that the standard method for fitting these data (which are indeed very well known among statisticians) is much much worse than the model in the linked paper.

In all seriousness, I strongly recommend that you write a paper fitting these data using a more sensible model. As noted above, the lynx data is famous in statistics, and I think you’d be making a very useful contribution by fitting a good model to these data.

Eric is right, the specific model they fit isn’t ecologically sensible. But IIRC more sensible models were fit to these data years ago. The lynx data aren’t just famous in statistics, they’re famous in ecology too, and many attempts have been made to model them.

They have been a lot of attempts indeed. Turchin in “Complex Population Dynamics” and Kendall et al., in “Why do populations cycle? A synthesis of statistical and mechanistic modeling approaches”, both cite Royama’s “Analytical Population Dynamics” (from 1992!) as the definitive statistical/mechanistic treatment of the Lynx cycle. I have yet to read Royama myself, so I can’t vouch for its statistical quality, but it should at least have been cited.

This leaves me asking: how do we increase the cross-talk between ecology and statistics (and I suspect, more broadly, between statistics and a lot of discipline-specific statistical work)? I know ecologists could benefit from attracting more applied statisticians (virtually every problem in field and experimental ecology I’d consider “interesting” involves this style of statistically fitting and testing mechanistic models); however, the best papers I’ve seen have been a result of applied statisticians working closely with the field workers and theorists who are able to say what models do and don’t make sense, and identify which patterns are useful in testing the model’s validity.

I just hope the statisticians who are active in the time series literature read this thread and the references you’ve provided.

What Eric said! Here are some further references for readers interested in the state of the art in ‘eco-statistical’ time series modeling. Not an exhaustive list, just some of my favorite papers off the top of my head:

Wood 2010 Nature 466:1102. Statistical inference for noisy nonlinear ecological dynamic systems. http://www.nature.com/nature/journal/v466/n7310/full/nature09319.html

Kendall et al. 2005 Ecological Monographs. Population cycles in the pine looper moth: dynamical tests of mechanistic hypotheses.

Wood 2001 Ecological Monographs 71:1. Partially specified ecological models.

Nelson et al. 2004 Ecology 85:889. Capturing dynamics with the correct rates: inverse problems using semiparametric approaches.

Ellner et al. 2002 Ecology 83:2256. Fitting population dynamic models to time-series data by gradient matching. http://www.esajournals.org/doi/abs/10.1890/0012-9658%282002%29083%5B2256%3AFPDMTT%5D2.0.CO%3B2?journalCode=ecol

Turchin et al. 2003 Ecology 84:1207. Dynamical effects of plant quality and parasitism on population cycles of the larch budmoth. http://www.esajournals.org/doi/abs/10.1890/0012-9658%282003%29084%5B1207%3ADEOPQA%5D2.0.CO%3B2?journalCode=ecol

Bjornstad et al. 2002 Ecological Monographs 72:169. Dynamics of measles epidemics: estimating scaling of transmission rates using a time series SIR model. http://asi23.ent.psu.edu/publ/bcf/bcf1.pdf

Rees and Ellner 2009 Ecological Monographs 79:575. Integral projection models for populations in temporally varying environments. http://www.esajournals.org/doi/abs/10.1890/08-1474.1?journalCode=emon

Harrison 1995 Ecology 76:357. Comparing predator-prey models to Luckinbill’s experiment with Didinium and Paramecium. http://www.jstor.org/pss/1941195

Lots of papers on disease dynamics from Aaron King (U Michigan) and his colleagues: see http://kinglab.eeb.lsa.umich.edu/lab/pubs. Of these, Ionides et al. 2006 PNAS is probably the most important: http://www.pnas.org/content/103/49/18438

Worth noting that not all these authors are ecologists; some, such as Simon Wood, are applied statisticians. The statisticians involved in this work are well-known in ecology; not sure how widely their work is known in statistics.

Also worth noting that this approach got a real trial by fire during the 2001 foot and mouth disease outbreak in Britain, when the British government asked teams of ecologists and statisticians to model an ongoing disease outbreak in real time and based its management decisions on their advice. See papers by Matt Keeling and others in Nature and Science.

Fitting biologically-motivated, mechanistic models to time series data is something ecologists have been doing for a while now. See work from Bruce Kendall, Ed McCauley, Bill Nelson, Simon Wood, Steve Ellner, and others, much of it published in Ecology. In disease ecology and epidemiology, this is pretty much the standard approach nowadays–see work from Brian Grenfell, Ottar Bjornstad, and other Penn State folks for instance. A range of interesting statistical approaches have been put through their paces, including ideas like “probe matching” (instead of fitting to the time series itself, fit to a multivariate vector of statistical features of the time series, aka “probes”) and “semi-parametric” or “partially-specified” models (these describe well-understood bits of biology with specified parametric functions and estimate their parameters from the data; they describe poorly-understood bits of biology with flexible non-parametric functions and estimate the form of the function from the data). Also lots of interest in fitting alternative models in order to test alternative hypotheses about the causes of population cycles, and in making effective use of prior knowledge (e.g., of plausible parameter values).

I can’t seem to reply to your other comment (perhaps there’s a nesting limit?), but I wanted to say (1) “+1” to both you (nice examples and explication of the context) and to Eric Pedersen (there should be more cross-talk between subject-area experts and statisticians) and (2) I believe the true story about how effective the models actually were in guiding policy in the foot-and-mouth outbreak is considerably murkier. I have anecdotal evidence from several sources (including second-hand information going back to someone who actually interviewed some of the principals) that it was much more complicated than “triumph of the mechanistic models”. (My own anecdote in this area goes back to trying to fit autoregressive-type models to measles epidemic time series in the early 1990s, as a contrast to more statistically naive but mechanistic approaches — standard ARIMA models were terrible and required very high-order models, although SETAR etc. may have been better for all I know.)

Hi Ben, welcome to the party. ;-) After I posted my previous comment, I actually thought “Oh shoot, I forgot to give Ben a shout out.” ;-)

You and I should sit down over a beer at ESA so I can get your anecdotes on the foot and mouth story. I haven’t talked to the principals, or people who talked to the principals. But I was in Britain when the outbreak happened and the tv and newspaper investigative journalism I saw and read at the time certainly put the modelers in a more flattering light than the government vets. There are of course a number of related but distinct questions one could ask about the effectiveness of the advice the British government received. That the advice from the modelers was less than optimal (and also less good than it could or should have been in the circumstances?) need not imply that the advice from the vets was any better.

There is a still-developing literature on the outbreak, with people taking the time to build and parameterize better (or at least different) models in an attempt to understand what the initial modeling efforts got right and what they got wrong. But I haven’t dug into this literature and so I don’t know what the current consensus is, or even if there is a consensus.

In any case, my main reason for raising the anecdote on this thread was just to highlight it for interested readers who weren’t aware of it. I’s a very interesting case study of mechanistic time series modeling in action.

This great post plus discussion reinforces my opinion that as an applied/consulting statistician I need to be familiar with more “named” models (e.g. predator-prey) to complement my standard “generic” model fitting approaches.

Hands up, those arguing there should be more discussion between ecologists & statisticians. How many of you work on ecological statistics? Things are certainly not perfect, but there actually is a lot of discussion between the two groups. There are meetings like ISEC (in Norway this year), Ecology has a section on statistical methods, there’s Methods in Ecology and Evolution which also publishes a lot of statistical material, and other journals will happily publish statistical papers (e.g. data cloning was first published in Ecology Letters). In stats, Biometrics publishes ecology, and of course other stats journals will publish ecology too.

Fair enough Bob. But as you say, things clearly aren’t perfect. If they were, the paper Andrew linked to presumably would not have been written, and neither would this post! I note that most of the material you cite is aimed at ecologists, or at applied statisticians who specialize in ecology and evolution. It sounds like there’s a need to broaden awareness within statistics (or even just that bit of it concerned with time series analysis) of the striking successes that ecologists + applied statisticians have been having with their mechanistic models. As I’ve noted over at the Oikos blog (http://oikosjournal.wordpress.com/2012/01/30/statisticians-meet-ecologists/), even within ecology I don’t know that there’s sufficient awareness of this work.

So… I have an applied maths background, and now I’m a statistician. I’m a bit puzzled by this.

The first time I saw this result was years ago when I was a PhD student and it was a minor side point in an Honours thesis presentation. (It was a model of Measles epidemics in NZ, and hours of MCMC fit worse than a 1 minute parameter fitting in Matlab for an SIR model).

I mean, how isn’t this done (and very well known!) already? It’s painfully obvious that baseline time series methods can’t fit this model well, although a multivariate model would obviously do better (you can eyeball the negative correlation with some delay).

But the *most basic* model, which is a staple of every first/second year ODE course as something that’s too complex to get a solution, but simple enough to do all of the phase plane/ steady state analysis would be something that I would’ve thought a lot of people would’ve look at a *long* time ago.

Hello. I am a statistical neophyte who is utterly terrified of posting to this blog (read: be nice), but I did co-author this related paper with Mevin Hooten:

http://www.jstor.org/pss/20640583

Based on the title, you’d never guess that it involves species competition. We embed a Lotka-Volterra competition (not predator-prey) model in a hierarchical framework, well aware of LV’s disadvantages. (The point here was to simply create the model structure, not so much to create a “good” model.) We use a famous Paramecium data set (from Gause’s “The Struggle for Existence”) and assume an underlying continuous process model.

Hi,

I am not an expert on time series analysis, so I am writing this with some trepidation.

The Canadian lynx series was analysed by Moran, to propose the moran effect, where

there is synchrony in populations across large spatial distances due to correlations

with environmental variables. Perhaps this may explain why the time series modeling

is so difficult? My apologies if this is already known and familiar to people here.

The original paper by Moran:

http://www.publish.csiro.au/?paper=ZO9530163

A couple of others that may be of interest.

http://www.jstor.org/pss/3545809

http://www.personal.psu.edu/pjh18/downloads/88_Hudson_Cattadori_Moran_effect_TREE.pdf

Pingback: Thursday – Carl Boettiger